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Perspectives in Vision: Conception or Perception?* 



M. T. Turvey 



ABSTRACT 

To a very large extent theories of perception assume that the 
stimulus at the receptors underdet ermines the perceptual experience. 
Consequently, the official doctrine holds that perception is predi- 
cated on conception: one must exploit one's knowledge about the 
world in order to perceive it. The manifestation of thj.s point of 
view in the currently popular information- process^.ng approach to 
visual 'perception is treated at some length. An alternative approach 
begins with the argument that the correspondence between the ambient 
optic array and the environment, the "what" of visual perception, has 
been insuf f ipiently examined and that the enterprise of modeling per- 
ceptual processes, the "how" of perception, is premature. Indeed, 
the alternative position suspects that stimulation dbes not underde- 
termine perception and that, contrary to tradition, perception is not 
mediated by intellectual processes. This thes^is of direct perception 
is developed and contrasted with That of indirect* perception. • 

From what cloth shall we cut the theory of visual perception? For most 
scholars over the centuries the answer has been singularly straightforward: con- 
ception. The official doctrine is that our visual perception of the world de- 
pends in very large part on our conception of it. To paraphrase: knowing the 
world perceptually rests on knowing about the world conceptually. The inquiry 
into visual perception is dominated by this thesis of conception as primary, and 
the first charge of this paper is to examine in elementary but reasonably de- 
tailed fWiion the manifestations of this the^irs in current theory a^id research. 

But^Jihere is another and contrasting point of view to which this paper will 
turn in que course; one that asserts the primacy of perception . From this 
standpoint, information about the visual world is obtained without the 



*This paper was presented as an invited State-of-the-Art address to the World 
Congress on Dyslexia sponsored by the Orton Society and Mayo Clinic, Rochester, 
Minn., 23-25 November 1974. It is to be published in Reading, Perception and 
Language , ed. by M. Rawson and D. Duane. (Baltimore, Md.: York). 

^Also Univfersity of Connecticut, Storrs. 

Acknowledgment : The preparation of this manuscript was supported in part by a 
John Simon Guggenheim Fellowship awarded to the author for the period 1973-1974. 
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intervention of conceptual processes. On this view, conceptual knowledge is the 
offspring of perception. To declare the independence of perception from concep- 
tion, is to declare the depend ence of perception on s^timulationj^ T he se c ond 
cKarge of this papeF is to examine \t:he meting and implications of these declara- 
tions . 



INDIRECT PERCEPTION: CONSTRUCTIVISM AND THE PRIMACY OF CONCEP^EION 
Introduction 

Only relatively superficial differences divide the various theories of 
visual perception with which we are most acquainted. With respect to the basic 
postulates on which each is founded there is an overwhelming consistency of 
opinion: they -^11 c(Sncur to a greater or lesser degree that perception begins 
with retinal data that relate imperfectly, even ambiguously^ to their source — 
the world of objects, surfaces, and events. 

Consequently, it is customary to liken the task of the perceiver to that of 
a detective who must seek to determine what transpired from the bits and pieces 
of evidence available to him. Extending this metaphor, we have supposed that 
there are "cues" or "clues" to be read from the retiilal image, and to a very 
large extent the endeavors of students of perception have been directed to iso- 
lating these cues and inquiring about .how they are used. The impression is that 
a large discrepancy exists between what is given in the retinal image and what 
,i8 perceived; like Gregory (1969), we wonder how so little information can con- 
trol so much behavior* From necessity we argue that the perceiver must engage 
in a large stock of inferential and hypothesis-generating and testing procedures 
that rely heavily on memory — on internal models of reality established in the 
course of prior experience of both the individual and 'the species. 

This general approach to the theory of visual perception has a long tradi- 
tion. Recall that John Locke drew a distinction between primary qualities as 
given and secondary qualities as inferred, and that Helmholtz coined the phrase 
"unconscious inference" to represent this theoretical persuasion. Currently it 
is given expression through the term "constructivism," which will be used In 
this paper for any theory proposing that in order to perceive one must go beyond 
what is given' in stimulation. Thud, we understand constructivism to mean that 
perception cannot be achieved directly from stimulation (one might say that 
stimulation underdetermines perceptual experience). On the contrary, perceptual 
experience is constructed or created out of a number of ingredients, only some 
"•of which are provided by the ddta of the senses. Other ingredients in a percep- 
tion recipe are provided by our expectations, our biases, and primarily by our 
conceptual knowledge about the world. The gist of the constructive interpreta- 
tion is conveyed in the following remark: "...perceptions are constructed by 
complex brain processes from fleeting fragmentary scraps of data signalled by 
the senses and drawn from the brain memory banks — themselves constructions 
from snippets of the past" (Gregory, 1972:707). Helmholtz (1925), in response to 
the ambiguity of retinal stimulation, might have said: we perceive that object 
or that event that would normally fit the proximal stimulus distribution 
(Hochberg, 1974). 

Internal Modeling of the Relation Structure of External Events 

The view of the brain as a complex information-processing mechanism that 
internally models the world is of central importance to theories of the 



cons true tivlst persuasion. This view, represented best by Craik (1943), can be 
Stated explicitly: neural processes or mental operations symbolically mirror 
objects, events-, and their interrelationships. 



The advantage of internal modeling^is that outcomes ^an be predicted. 
Menta l procesflRR a l low u s to pr oc e ed vi ca ri ous ly thro ug h a pattern of -met4o 
a succession of events much as an engineer determines a reliable design for a 
. bridgfe j>£^foxe Jie. bjeg±ntt buildings -lit- ^hort^ we^^^nr- would 
occur without actually performing the test, the outcome of which may he useless 
or even harmful. There would seem to be little reason to debate this property 
of mind with reSpect to thought and language, but can we with the same confidence 
regard visual perception analogously as a modeling, imitative, predictive pro- 
cess? The cons true tivist answer is an unequivocal, "Yes." We may conceptualize 
visual perception ^s the task of creating a short-term model of contemporary 
distal stimuli out of, on the one hand, the contemporary but crude proximal 
stimulation, and on the other, from the Internalized long-term model of the w0^d 
(cf. Arbib, 1972). This is a significant feature of conAructivism: it;^ncour- 
ages us to examine the possibility that psychological processes such ^thought, 
language, and seeing are more similar than they are different. Th^lcinds of 
^Jcnowledge— heuristics, algorithms, and so on — that permit thou^trtinay not differ 
"significantly from those that permit language, and these in^>fetirn may be equiva- 
lent to those that yield visual perception. Paraphrasin;&^lers (1968), Katz 
(1971), Sutherland (1973), and others^ there is nothij>g that would suggest to a 
constructivist different kinds of intelligence undjaflying these apparently dif- 
ferent activities. J 

That brain mechanisms can model the>\rarld in the sense of exhibiting pro- 
cesses that have a similar relational^-^ructure to sets of physical events, is 
an intuition that is currently receiving some measure of supporjt in the labora- 
tory. Echoing Craik* s (1943) >y^thesis, Shepard (in press; Shepard and Chipman, 
1970) proposes a second-ord^ isomorphism between physical objects and their in- 
ternal representations. This isomorphism is not in the first-order relation, be- 
tween a particular object and its internal representation (as Gestalt psycholo- 
gists' used to have it), but rather in the relation between (1) the relations 
among a set of objects and (2) the relations among their corresponding internal 
representations. To quote Shepard and Chipman (1970), "Thus, although the in- 
ternal^ representation of a square need not itself be a square it should (what- 
ever it is) at least have a closer functional relation to the internal represen- 
tation of a rectangle tljian to, say, a green flash or a taste of persimmon." 

There are two experiments that speak favorably for the notion of a second- 
order isomorphism. In the first (Shepard and Chipman, 1970), fifteen states cf£ 
the United States — ones that did not differ too greatly in size — were selected 
for similarity judgments. One hundred and five pairings of members of this set 
were presented in name form and ranked by the subjects according to similarity. 
The same subjects then ranked the same 105 pairs of states presented in picture 
form. The structure of similarity relations among the shapes was virtually 
jLdentical whether the states compared were there to be perceived (pictorial pre- 
sentation) or could only be Imagined (name presentation). Moreover, for both 
name and pictorial presentation, the similarity judgments corresponded to identi- 
fiable properties of the actual cartographic shapes of the states (see also 
Gordon and Hayward, 1973). 
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In the second experiment (H^ir^er and Oliver^ ^972) , a second-order isomor- 
phism is implied between the4t:ructure of color space ae represented in memory 
and the structure of colojx^pace .^as given In perceptual experience. They sougl^t 
Ha d^fprnil tip whf^thf^.r t:y<< ppnplA, Tjrhn HrT f f at:^/^ - m^grkod ly-.1^ ^^1^ -(-^.h^^ 
Dani of New Guinea afid Americans*) actually structured the color space in marked- 
t y . d l£f ejce&t^-way^ — Thege w e re two ta s ks; In pri e ^ stibjc ct o from t h e t we-peptt3ra — 
tions tiamed^ colors; in the other, they matched cTolors from metaory. Through 
" tldim^slonal scaling, it was- ^^^rmined that the ettufitiire of the'cdor 
pace- tg'as remarkably similar for the two populations when jderived from the 
eiaory datra but, as expected, quite dissimilar when derived from naming. The 
mportant comparison here lies between the relational structure pf colors in 
mfemory, and the relational structure of colors in perception (Shepard, 1962; 
Helm, 196A) . On the available evidence, perceptually the two structures 
appear to be virtually the same (Helder" and Oliver, 1972). 

One Jaay argue from these experiments that ^memory pre'serves or mimics the 
structurafl relations among perceptual properties. But th^ notion of brain 
mechanisms modeling external events suggests something more: an isomorphism be- 
tween external and internal processes . We turn, therefore, to experiments that 
examine the following proposition as a corollary of the second-order isomorphism 
theorem: when one is imaging an external process, one passes through an orderly 
set of internal states related- in a way that mimics the relations among the 
successive states of the external process (cf. Shepard and Feng, 1972). 

Suppose that you are shown a pair of differently oriented objects pictured 
in Figure 1 and that you have to*determine whether the two objects are the same 
or different. I suspect that you would reach your decision' by manipulating the 
objects in some way, say, by rotating one of them and comparing perspectives. 
But suppose that you afre shown a pair of two-dimensional portrayals pf the 
three-dimensional objects (which, of coursa, is what Figure 1 is) and that your 
task is to decide as quickly as possible (and obviously without the aid of man- 
ipulation) whether one of the objects so depicted could^ be rotated into the 
other. In an experiment of this k-ind (Shepard and Metzler, 1971), it was shown 
that the decision latency was an increasing linear function of the angular dif- 
ference in the port^^jrf^ed orientation of the two objects. At 0° difference, the 
latency was 1 sec, w'liii'^le at 180° difference the latency was 4 or 5 sec. Each 
additional degree of rotation added approximately 16 msec to the latency of rec- 
ognition. This was essentially so whether the rotation was in the plane of the 
picture or in depth. 

This example illustrates the capability of neural processes to model the 
relational structure of external happenings. For 'further instances, the reader 
is referred to Cooper and Shepard (1973) and Shepard and Feng (1972).. 

What Presupposes Indirect Perception? ^ 

What is behind assumption that stimulation underdetermines perception 
and that to perceive visually one must go beyond what is given in the light to 
an ocular system? What is the given ? History* 6 answer is that the reference is 
to the stimuli for the receptors or to the sensations provided by the senses. 
Clearly, the. need to suggest that a brain must guesS, infer, or construct will 
have its roots primarily in the attributes that tradition has adduced for stimuli 
and/or sensations. - • 
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We can begin with the notion that they are given and given directly. Con- 
struction presupposes sonje basic materials, and therefore all constructive 
theories concede a preliminary stage whos^* otitput is not guessed or created. 
And the degree to wl^ich a theory conceive^ perception as indirect is inversely 
related to the elaborateness of the evidence deemed to be directly detectable. 
The point sensations that occupied this preliminary stage in the thinking of the 
structuralists have given way to the oriented lines and angles imposed upon us. 
by an impressive neurophysiology (Hiibel and Wiesel, 1967, 1968) and the prag- 
matics of -buil^iing workable computer programs for pattern precognition (Minsjcy, 
1963) . ' ' ' \ 

\ 

A current idea is that preliminary to object recognition is a stage that 
detects primitive features. It^ is assumed as a matter of principle that there 
is' a .finite number of §uch features, which poses the problem of how an indefin- 
itely large number of objects can be discriminated and recognized. In order; 
that infinite usage may be made of a limited number of features, subsequent pro- 
cesses in perception must instantiate knowledge about relations among features 
and must, perhaps,, be capable of generating a variety of objWt representatigns 
(Mlnsky,. 1963). As one 'theorist (Gregory, 1970:36) has remaned: "Perception 
must, it seems, be a matter of seeing the present with sttrfScT objects trom tlie 
past."' -» 

The assumption that what i^ ^ven directly in p^ception is a finite set of 
punctate elements necessitates a search beyond th& stimulus for an explanation ^ 
of the perception of' objects and of the fact that we experience optical events 
as spatia|J.y pnified. The assumption of stimuli as punctate demands a theory of 
indirect^* constructed perception. ■ ' ■ 

I ■ — ■ 

It ISi^curious that the the circumscribed feature set, so far uncovered 
neurophysiology, has been adopted energetically and somewhat uncritically as 
departure point for much of current visual perce|)tion theory. The experiment 
evidence is that there are neural units selectively sensitive to simple sp£ 

^relations (e.g., a line&tl a given orientation) and to simple spatial-temp6/ral 
relations (e.g., a line^oving in a given direction) . Does this evidence de- 
limit the list of feature detectors? Clearly, there is no a. priori reason for 
believing that future research will not reveal detectors, or more aptly, systems, 
selectively ^sensitive* to more complex optical relations. The gr\)wlng evidence 

. for selective sensitivity to spatial frequency (gekuler, 1974) c<5rroborates this 
suspicion. , * 

But whatever our misgivings, we should not lose sight of the important and 
instructive fact that the above "simple" spatial and spatial-temporal relations 
are characterized' as being directly given. In using the term "direct" with/ re- 
spect to the detection of features, we intend the following. First, the course 
ot detection does not involve establishing an internal represent^ation of the 
feature that is then matched by some separate process against a *st6red represen- 
tation. Second, and related, the pickup^ of these features does not depend on 
itiformation other than that currently available in the stimulation; that is to 
say, their detection is not underdetermined. 

Closely allied w?t^ the interpretation of stimuli as punctate elements, is 
the prevailing assertion that stimuli are temporally discretsne. If stimuli are , 
momentary, then there must be routines for combining s^timuli that are distributed 
in time. This is exemplified in the familiar problem of how scenes are perceived 
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through a succession of eye movements. An aim of the fovea corresponds to t^hfe 
reception of a sample of the available stimuli. , Thus a succession of aims • 
yields a succession of r:etinal images and JESrefore a succession of different 
sets of punctate elements. - For the obser^r-who moves or scans, the fitting- 
together operations performed on each set of adjacent punctate elements must be 
supplemented by processes that collect together and synthesize the products of 
these operations over time. 

Hochberg (1968, 1970) and Neisser (1967) have emphasized the distinction 
between the information in a single glance and the integration of information 
from a successipn cTf glances. They argue tjiat a panoramic impressiion cannot be 
specified by a'' single fixation, for in general a single fixation provides the 
observer with only local ii^formation about tfie three-dimensional structure of a 
scene. Moreover, only in the foveal-region of a fixation is the. information de- 
tailed. Thus, to > perceive >a scene, an observer must integrate several fixations. 
Evidently panoramic perception is constructed, but how this construction-through- 
integration takes place remains very much a mystery and to date there are no 
viable accounts of how it might occur. 

^ The characterization of stimuli as punctate and momentary supplies the 
backdrop for one of the more heralded reasons for claiming tha^t perception^ goes 
beyond what is given. The well-known phi phenomenon provides "a 6ase in point . 
If two lamps are lit in alternation, with an appropriate interval elapsing between 
successive lightings, one perceives a single lamp moving badk and forth. Punc- 
tate and momentary characterizati'on of stimuli would indicate that there are 
actually two stimuli separate in space and time, although perceptually ,|he im- 
pression is of a single stimulus moving from one point in space to anotKer. 
Thus, we appear to have a compelling reason for the notion that percept;^bn goes 
beyond what is given: perceiving a stimulus where it is not. /'^ 

One particularly dramatic "variant of the phi phenomenon, which would seem 
to dictate rather than invite the constructive interpretation, is provided by 
Kolers (1964). When a Necker cube is set into oscillating apparent motion, an 
observer under the appropriate conditions will see the cube as rotating in "mid- 
flight." In similar circumstances, one can also see circles elastically trans- 
formed into squares and upward pointing arrows rigidly transformed into downward 
pointing arrows (Kolers and Pomerantz, 1971).. We see a changing form where 
there is no form changing. Moreover, the particular transformation experienced 
befits the forms involved. Based on these observations, seeing mirrors thinking. 

\ If perception is indirect, we may ask: What do perception by eye and per- 
ception by hand have in common? We might answer "very little" on the Issumption 
that an eye delivers to a brain sensatj^ons or features that, are radically dif- 
ferent from those that a hand delivers. So how is it that I can often perceive 
by eye that which I can also perceive by touch? Traditionally, the answer to 
this question has been sought in a mediating link — frequently association — which 
relates visual impressions to tactile impressions, and vice versa. In this view, 
the data of one sense are r^itionalized — given meaning — by the data of another. 

Where we. conceptualize the senses as yielding distinctively different data, 
we are forced in<:o assuming that cross-modality correlation is an essential fea- 
ture of perception. This is especially so where we believe that the data of one 
sense are intrinsically less meaningful than the data of another. * Since the 
time of Bishop Berkeley, -vision has been construed, repeatedly, as parasitic on 
touch and muscle kinesthesis. 



Let us now turn to pictures, for they would appear to provide prima facie 
evidence that perception must go beyond what is given in the stimulation. Pic- 
tures are flat projections of three-dimensional configurations and they can be 
visualized in either way. Further, it is quite obvious that we can "se^" the 
three-dimensional structure represented by a picutre even when the information, i? 
peculiarly impoverished, ^as witnessed by outline drawings. We can also "see" 
the appropriate three-dimensional arrangement, even though the picture is am-i • 
biguous in the sense that the same projected form could arise from an infinite 
variety of shapes. In Goodman's (1968) view, pictorial structure is an arbitifajry 
conventional language that must be learned in a way that corresponds to how we ; 
learn to read. For Hochberg (1968) the resolution of the problem of picture | 
processing is sought in the schematic maps stored in one's memory banks and 
evoked by features. Similarly, Gregory (1970) has looked to object-hypotheses j 
as the tools by which we disambiguate a two-dimensional portrayal of a three- i 
dimensional scene. Though, none of these authors provides anything like an • \ 
account of how "other knowledge" is brought to bear on picture processing, the . 
task has been taken up in earnest by workers in artificial intelligence. It is \ 
worth noting that there are students of perception who, contrary to the above '\ 
points of view, suspect that the perception of pictures need not be indirect 1 
(Gibson, 1971; Hagen, 1974). ^ 

Finally, 'there is the idea that the environment is broadcast to the observer 
as a se^ of visual cues or clues that are intrinsically memiingless. Here the 
claim is that the relation between clues and assigned meaning is homomorphic: 
one clue may be assigned many meanings, tnany clues may be assigned the same mean- 
ing. The perceiver, therefore, is assumed to possess a memory-based code that 
rationalizes the complex of clues read off the retinal image. In sum, where the 
stimuli are considered as only clues to environmental facts, there is a need for 
positing mediating constructive activity as the means by which these facts are 
determined. , 

Information Processing^: A Methodology for Constructivism 

Closely cognate with the constructivist philosophy is an approach to prob- 
lems of perception that is currently in vogue and that may be loosely referred 
to as "information processing" (Haber, 1969). Implicitly, it take^ as its de- 
parture point the assumptions as' noted above. More explicitly it defiu^es per- 
ception not as immediate but as a hiai'archically organized temporal sequence of 
events involving stages of storage ant|id:ransfonnation. Transformations occur at 
points in the information flow where storage capacity constraints demand a recod- 
ing of the information. Such recoding must exploit long-term memory structures — 
or internal models of the world — and, in keeping with a fundamental constructiv- 
ist belief, perceiving cannot be divorced from memorial processes. Guided by 
these assumptions, information processing seeks methods that will differentiate 
the flow of visual information on the nervous system; that is, methods that will 
decompose the information flow intd discrete and temporally ordered stages. In 
the main, backward masking (e.g., g^Wling, 1963; Turvey, 1973), delayed partial- 
sampling (Sperling, 1960), and reaction-time procedures (e.g., Posner and 
Mitchell, 1967; Sternberg, 1969) singly or in combination have provided the 
requisite tools. 

An example will provide an elementary illustration of the in formationr process- 
ing approach: the simple task introduced by Sternberg (1966, 1969). A display 
of one to four characters is presented briefly to an observer who must press a 
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key to indicate whether a subsequently presented single character wasr pr was not 
a member of the previously displayed- s$t. In general, as. the number of Items In 
the memory set Increases from one to four, the latency of the observer's response 
to the probe Increases linearly. The linear plpt, of latency against number of 
Items, has two characteristics:, slope and Intercept. We might assume that the 
slope of the .function Identifies the process of comparing the probe character 
with the representations of the characters In memory. But what of the Intercept? 
Does It signify a perceptual/memorial operation or simply the time t^ken to 
organize the response? 

Suppose that we now degrade the probe character In some way. Does the de- 
grading affect the memory comparison or some other process? If It affects memory 
comparison, then we should expect the slope to alter, assuming the validity of 
our original Interpretation. An experiment of this kind reveals that degrading 
the probe essentially leaves the slope of the fuQCtlon Invariant but It does 
raise the Intercept (Sternberg, 1967), We can now argue that the Intercept re- 
flects, at least In part, that processes of normalizing and perceiving "^the probe 
precede memory comparison. In short. In the performance of this simple task we 
can identify two Independently manlpulable and successive stages. Witness to 
"the potential usefulness of this distinction is the observation that 'with words 
as both memory items and probes, ppor readers differ from good readers only in 
the height of the intercept (Katz '^nd Wicklund, 1972). 

Though it is true that inforiiiatlon processing as^ an approach often provides 
an elegant framework and set of procedures for examining perceptual^rdcesses, 
it is also the case that the descriptions it yields are for the most part crudfe 
and approximate. It is^ not unjust t^^ say that the information-processing method- 
ology is lijnlted to a broad identiflQa|ion of stages; and is inherently "insuffli- 
ciently powerful to supply sophist ica^lSetJ descriptions of perceptual procedures 
and their complex interrelationships/ :\feor the achievement of a more rigoipous 
account of the how of perception, con^|j^ctlvlsm may have to look elsewhere. 

Scene Analysis by Machine: Formalized Gcj^iiistructivism ^ 

It is by now evident to the reader '|h&^ constructivism conceives perception 
as an act involving a potentially large ^ai^lety of knowledge structures* To 
gain a purchase on the form of Such strui^lliires, to discover effective represen- 
tations (Clowes, 1971) and how they coula l?:elate, is in part the task of research 
and theory In artificial intelligence. |^ will prove instructive for our pur- 
poses to look at systems sufficiently l^^t^lllgent to infer the three-dimensional 
structure of objects from twp-dimensionM line portrayals of opaque polyhedra of 
the type depicted in Figure 2. ||^ 

The early work in pattern recognitifoti by machine was dominated by models 
that held that patterns could be class jp|^fed by a procedure listing feature values 
and then mapping these values onto cat||;|rles through statistical decision pro- 
cesses (e.g.. Self ridge and Nelsser, l||6) . Contemporary work follows the prin- 
ciple that pattern classification sys^j|ms must possess the ability to articulate 
patterns into fragments and to sipeclfy^kelatlons among the articulated fragments 
(Mlnsky, 1963). Consequently, artlf:^p;|al intelligence research has been 
attracted to the eitructural models t)i|t have proved successful in linguistics 
and the focus of /the enterprise has^pwltched from the problem of patterp recog- 
nition to the problem of pattern dgscrlptlon . The search is for structural 
grammars that describe the relati^ishlps among the parts of a pattern. 
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As a preliminary to our discussion, let us conjecture on the types of stages 
that might intervene or mediate between a picture and the resulting three-dimen- 
sional description. First, we can assume that a picture is projected onto the 
retina as an array of points that can be partitioned Into set& as a function of 
brightness. The three-dimensional properties of a picture cannot be inferred 
from this initial points representation. The endeavors, then; of early visual 
operations must be that of recovering from these patches of brightness, the 
lines that make up the picture and from these lines, the regions into which the 
picture may be parsed. (For the purposes of computation,, a region may be defined 
as a set of points with the property that a path drawn between any two elements 
of the set does not cross a line.) We can usefully refer to these three repre- 
sentations — points, lines, and regions--as being in the picture domain , for as 
yet they are indifferent to the three-dimensional structure the picture repre- 
sents. But what we want is a three-dimensional account, a description of a 
scene. Having recovered regions, the next constructive operation is to map the 
regions description onto a surfaces description and this in turn onto a descrip- 
tion of the bodies to which the svirfaces belong. To facilitate the discussion 
that follows, we will refer to these two representations — surfaces and bodies — 
as being In the scene domain and recognize that most of artificial intelligence 
^"^^"""^^""^search on intelligent picture processing has been concerned with the mapping 
' betweeif the picture and scene domains. The final representation and domain — the 
objects domain — is obtained by identifying the objects the bodies represent. 
This task probably requires knowledge of a ^higher order such as permissible real- 
world relations among surfaces and the functional capabilities of variously de- 
scribed bodies (for example, cups are for containing liquids but they could also 
be used as effective paper weights). 

Thus, arriving at a three-dimensional description of a picture may be inter- 
preted as the construction of a number of representations in an orderly fashion 
. from less to more abstract. A representation is sai^ to consist of a ^et of 
entities together with a specification of the properties of those entities and 
the relationshi^jg existing Samong them (Sutherland, 1973). We can now provide an 
elementary description of computer methods for analyzing complex configurations 
of objects-^ presented pictorially. The computer is programmed to begin by detect- 
ing locaJr enti-ties or features of a given pic^re. Then it searches for rela- 
tionships ^ong -entities that indicate the presence of particular subpatterns. 
] ' The determination%f ^^pecific relationships of subpatterns allows for the. detec- 
tion of more global patterns, i.e., for the establishment of a higher-order rep- 
resentation, and so the procedure continues unHl a satisfactory structural 
description of the scene has been obtained. \ 

Let us examine in an approximate way a few representative programs for 
scene analysis. We are indebted to Sutheirland' s (1973) lucid clarification of 
I, these complex programs. 

We begin with Guzman's (1969) well-known program partly because it is rela- 
tively simple and partly because it provides the jumping-off point for many sub- 
sequent programs. Guzman *^s program takes an input that has already been parsed 
into regions and seeks to discover from this description how many separate bodies 
are present in the two-dimensional portrayal. The goal of the program is quite 
modest: it ainis only at specifying which groups of regions are the faces of a 
single body — a preliminary., step to arriving at a three-dimensional account. The 
inferences are based on tfie properties and implications of vertices of the kind 
shown in Figure 3, wit^ each vertex being classified on the bafiis of how many 



18 11 



ERLC 




v 



lines meetf at the vertex^and their respective orientations. Essentially, the 
program having classified the vertices into types, establishes links between 
regions that^meet at vertices. Consider the arrow type of vertex. This would 
normally be caused by an exterior corner of an object where two of its plane 
surfaces form an edge. Consequently,* the two regions that meat on the shaft of 
an ^rrow — that is, those that are bounded by the two smaller angles — are linked; 
those regions that meet these regions at the bacbs of the arrow, however, are 
not. Similarly, a vertex of the fork type depicts three ^aces of an object so 
that links may be inserted across each line. Hy linking' the regions as a func- 
tion of where they meet, it is possible to separate the bodies represented in 
^Figure^. The final stage of the^program consists of grouping together regions 
Connec^ted directly by a link or indirectly^y a chain of links with other re- 
gions. - » 

Where a region is connected by only a single link to the members of a con- 
nected C(j>llection of regions, it is not identified as a member of that collec- 
tion. Additionally, the program incorporates a policy of examining in a limited 
fashion nei^boring vertices in order to determine whether links should indeed be 
made at a given verliex. 

It is evident that'the more closely and formally we examine the question of 
hoi^a picture is deoompgrsed Vinto separa^te bodies, the more sensitiv(^ we become 
to the complexity of the problejn. . Guzman's (1969) program^ though primitive, 
hints at the type of heuristics that may have to be employed by a human observer 
in arriving at a description of a picture. Of special importance is the sugges- 
tion that the f igjare-grouAd^r segmentation problem, which is often discusbed 
airily in a single sentence of a text on visual perception, is not likely to be 
solved by the nervous Eiystem in any simplistip manner. In this context one is 
reminded of Hebb's (1949) distitictionTbetween primitive and nonsensory unity. 
Segmentation in the former case could be done on the basis of brightness differ- 
entials, in the latter it could only be achieved, so Hetb argues, through an 
application of^ knowledge of patterns and objects. 

Guzman's progran/is ncjt especially successful. In a significant number 
of instances it fails to achieve a correct decomposition of k scene. Lt is 
obvious the program is insufficiently knbwledgeable. What is needed is knowl- 
edge of what is permissible and what is not in the structuring of real-world 
objects. ' ^ 

To map from regions to surfaces asks that we describe lines in the picture 
domain as e\[itities in the scene domain. In the general case, a picture line can 
re^rtresent a number of different scene entities, such as edge or boundary of a 
shadow. How may object edges "Se recovered from line drawings? It turns out that 
vertices provide a useful basis for the edge interpretation of picture lines: 
Re^gions that meet at a line will correspond to surfaces in the scene domain that 
meet at a convex ot concave edge; certain arrangements of adjacent regions- will 
correspond to the occlusion of one surface by another. Fortunately, with respect 
to edges each vertex type identifies a limited set of three-dimensional inter- 
pretations. Furthermore, by mapping vertices in the picture domain onto corners 
in the scene domain it becomes possible to eliminate ambiguous edge interpreta- 
tions: the criterion is whether the set of edge interpretations for a set of 
vertices around 1a region is consistent (Clowes, 1971). 
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Limiting his problem to that of pictures with no more than three lines meet- 
ing at a point, Clowes (1971) draws up an exhaustive classification of corners 
into four types. The classification Is based on the concave/convex relationship 
between surfaces and the Visible/invisible properties of surfaces. A Type I 
comer is defined as one in which the three edge^ meeting at the comer are all 
convex; a Type II is one in which two of th€?tedges are convex and one concave; a 
Type III comer has two concave edges and one convex edge; ^and a Type IV corner 
Is one in which three concave edges meet. The comers are further siibcategorized 
according to the number of visible surfaces. 

How may we construct a representation of a picture in terms of surfaces 
(three-dimensional entities) and their relationships? In part the solution lies 
in defining the* mapping of vertex type onto visible comer type. First, the 
Clowes program recovers regions and, associated with each region, the lines and 
vertices bounding the region. Then the vertices are classified and for each 
vertex recovered the program lists the possible corner interpretations. The 
problem now is to arrive at an unequivocal description of corners and hence of 
the mariner in which the surfaces present in the picture articulate. To do this 
the p;:ogram asks: Which set of comer interpretations are compatible with the 
structure of three-dimensional bodies? Obviously, the program must instantiate 
knowledge about the physical structure of polyhedra in order that it might an- 
swer this question and reach a satisfactory three-dimensional description of the 
scene. It turfts out that the .following single fact proves sufficient to elimin- 
ate .impossible combinations of corner interpretations: where two surfaces meet 
at an edge, that edge must be invariant (either convex or concave, or implicate 
one surface behind anather) throughout its length. Thus, given two vertices con- 
nected by a line'f one could not interpret one vertex as being of Type I and then 
interpret the other as being of Type IV, for this would mean that the edge cor- 
responding to the connecting line changes from convex to concave, wh}.ch is con- 
trary to the stmcture of possible polyhedra* It is important to recognize that 
where this injunction imposes constraints on the interpretation of a pair of con- 
nected 'Vertices it automatically restricts the corner interpretations of neigh- 
boring vertices. Through iteration it will lead to the discovery of sets of com- 
patible comer interpretations. 

< Thus, the Clowes prograin-exceeds the proficiency of Guzman's by constructing 
£^ three-dimensional description of the bodies present in a scene. It success- 
fully differentiates holes from bodies and rejects . impossible polyhedral struc- 
tures. Its limitations include that it acceipts only pictures with three or fewer 
lines converging on a point, and it accepts these pictures only when the lines 
tpave been specified beforehand. We may say of Clowes's program that it starts 
upstream avoiding the problem of recovering lines from points. 

Althougli this may not appear to be overly serious, consider that in natur- 
ally illuminated scenes it is often the case that fae^s of different bodies may 
reflect the light equally, which means that some edges will not yield a bright- 
ness differential or, if they do, it is below some working threshold. How might 
such edges be recovered? How are lines missing from the picture domain detected? 

Where programs have been written to take as their starting point the actual 
output from a television camera (i.e., an input of points), it has proved fruit- 
ful, even necessary, to introduce greater flexibility in the relations among rep- 
resentations and to extend the knowledge of the program to include descriptions 
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(prototypes) of possible objects. This higher-order knowledge can be used to 
guide the construction of lower-level representations. 

A program that makes modest use of these principles is that of Falk (1972). 
Here directly recoverable lines, are assigned to different bodies* by a variant of 
Guzman's program. Special heuristics beyond those specified by Guzman are 
needed, however, to achieve a sepnentation into bodies given that certain lines 
are just not presented in the input. At all events, having obtained a rough de- 
composition of the scene into bodies, the program then applies heuristics to fill 
in the missing lines. It is important to recognize that these low-level opera- 
tions are meant only to provide a sketch — a working hypothesis — of what a repre- 
sentation of the bodies in the scene might look like. 

But in order that we might better understand Falk' s -program, let' us examine 
more closely the kinds of scenes the program is dealing with and some of the 
program's esoteric capabilities. There are nine permissible objects (polyhedra, 
as before) of fixed size that can be variously arranged on a table top such that 
some objects may be resting on the faces of others. The pVogram possesses highly 
specialised knowledge about the three-dimensional coordinates of the television 
camera relative to the table surface, and about the exact sisje and shape of each 
of the nine objects. The program will use its exact knowledge of the dimensions 
of the permissible objects in order to recognize them, but to apply this knowl- 
edge,. it needs to be able to compute the lengths of at least some of the edges 
of^a given body. ^ Since it knows the position of the camera relative to the- 
table top, it can infer the exact position of any point on the table in 3-space, 
and in consequence, the lengths of those edges that contact the tabje can be de- 
termined. 

■ \ 

The outcome is a determination of which base edges of which bodies contact 
the table and the program then calculates the lengths of the base edges of each 
body and the angles between them. These data are then subjected to operations 
that, seek to n^tch the bodies corresponding to the calculated base lengths and 
angles toi^ the program's knowledge of permissible objects. Recognizing a body as 
a particular/ object carries the bonus of knowing fully the exact dimensions of 
the body — and that includes, of course, the faces that might^Jie^illiprt^ible" in 
the picture. By recognizing which objects are in contact wiTth the ta^le, it is 
now possible- to reassess the earlier conclusions about which bodies were support- 
ing which bp^ies. 

In the final stages of Falk's (1972) program, once supported bodies are 
isolated from those resting on the table; it maps the supported bodies into the 
object domain and is then in a position to provide a description of the entire 
i^cene in terms of objects and their relationships. But recall that the lines 
representation from which all else developed was incomplete and the bodies rep- 
resentation was constructed with a strong element of guesswork. To ensure that 
the description in the objects domain is accurate, a lines representation is 
synthesized from the objects representation and mapped onto the original lines 
representation. Where a significant mismatch occurs, the bodies and the objects 
representations may be revised guided by the now available knowledge about the 
global properties of the scene. 

The latter .provides the significant feature of the Falk program, namely, 
the introduction of ^a more flexible processing routine , which permits less/ 
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abstract descriptions to be reexamined and the information present there to be 
reinterpreted in the light of more abstract descriptions — thus the objects rep- 
resentation is used to reevaluate the lines representation. This mode of organ-" 
ization is becoming more of a necessity in writing computer programs to perform 
Successful scene analysis (Minsky and Papert, 1972). We have seen that the pro- 
gram relies heavily both on exact knowledge of a 14.mited number of objects and 
on the fact that no other objects than these will be pictured. But we should 
not be too seriously put off by this austerity and rather should view Falk's 
program as an illustration of how prototypes might be exploited in visual per- 
ception. This approach is represented more elaborately in^he program of Roberts 
(1965), which has knowledge both of prototypes and of their lawful perspectives. 
Roberts argues that the computer program should assume, like the human observer, 
that a picture is a perspective of a scene obeying the laWs of projective geoijie- 
try. In addition, the Roberts program uses only three prototypes — a cube, a 
wedge, and a hexagonal prism — and is able to treat aJ.arge array of complex poly- 
hedral objects as combinations of these prototypes. Jl)e idea that the structure 
of human memory for things seen may be described as prototypes together with 
transformation rules has received some measure of experimental support (Franks 
and Bransford, 1971). ^ 

Xf I have dealt at some length with these examples of seeing machines, it has 
been for several reasons. In the first place, they foster an appreciation of the 
complexity of formalizing visual processes within the constructivist framework and 
yet at the same time suggest the directions this formalizing might take. In the * 
second place, they rarely find their way into the psychological literature on 
visual perception and that, in my opinion, is a serious oversight from the point 
of view of constructive theory. Lastly, t^y permit the illustration of certain 
principles that may have significant implications for the general theory of 
vision. 

\ Thus from the perspective of a seeing machine, the stuff with which a brain 

Works is descriptions, and not images, of optical events. Efforts to explain how 
Vision works ought to focus on computational or symbolic mechanisms rather than 
|he physica,! mechanisms that have been the mainstay of traditional theories. The 
IWtter, aa'^i^ixiBky and Papert (1^72) point oat, are inherently incapable of 
accounting for the influence of^other knowledge and ideas on perception. 

The par|:icul'!ar form of computational or symbolic mechanism currently being ^ 
advanced «is not hierarchical. It cannot be said to consist of parts, subparts, 
and sub-subparts that stand in fixed relation to each other. While a hierarchi- 
cal label may be appropriate for a physical system, it is less so for a computa- 
tional system. The latter often exploits the method of two different procedures 
using each other as subprocedures, and thus */hat is "higher" at one time is 
"lower" at another. To capture the essence of nonhierarchical, highly flexible 
mechanisms, the terpiis " heterarchical" (McCulloch, 1945; Minsky and Papert, 1972) 
and "coalitional " (Shaw, 1971; Shaw and Mclntyre, 1974) have been suggested. 

Though the formalization of coalitieSas is still very much in its infancy, 
we can identify in a rough and [approximate way some fundamental features that 
distinguish the coalition (heterarchy) from the more familiar hierarchy. First, 
many structures would function cooperatively in the determining of perception, 
although not all structures need participate in all determinations. Second, 
while it is certainly the case that a coalitional system has very definite and 
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nonarbitrary structures, the partitioning -of these structures into agents and 
instruments and the specifications of relations among them is arbitrary. In 
short, any inventory of basic constituent elements and relations is equivocal. - 

Perhaps the main emphasis of the coalitional formulation is the flexibility 
of relations among structures. Falk's (1972) program is a very modest instantd^- u 
tion of these coalitional features, Winograd's (1972) tzelebrated language under- 
standing system is a more ambitious one. Contrary to miich of current theoretical;^ 
linguistics, Winograd's system is concerned more with the problems of ^represent- 
ing the meanings conveyed by discourse than with the grammatical structure of 
discourse. The system is predicated on the coalitional thesis that sentence com- 
prehension necessitates an intimate and flexible confluence among grammar, 
semantics, and reasoning. In the Winograd system the sentence "parser" can 
search out semantic programs to determine if a particular phrase makes sense; 
semantic programs can exploit deductive programs to determine whether the pro- 
posed phrase is sensible in the context of the current state of the real world 
(Mipsky and Papert, 1972; Winograd, 1972). The fundamental principle of opera- 
tion, though complex, may be stated simply: each piece of knowledge can be a 
prQcedure and thus it can call on any other piece of knowledge . 

Perception at a Glance: The Contribution of the Information-Processing Approach 

The preceding discussion has focused on the theory of seeing as a construc- 
tive act. In an elementary but sufficient manner^ we have considered some of 
the principal notions, the scaffolding if you like, on which theories of indirect 
perception are erected. Next let us examine experimental efforts to unravel the 
processes by which information to an eye is "transformed, reduced, elaborated, 
^ stored, recovered and used" (Neisser, 1967:4). 

Our initial focus is the analysis of perception at a glance. Much of in- 
formation-processing research has revolved around tachistoscopic presentations 
and an interpretation of their perceptual consequences. Moreover, the materials 
briefly exposed have been for the most part letters, numbers, and words, so that 
we might take the liberty of describing the analysis as that of "the stages un- 
derlying the perception of linguistic material in a single fixation." The dom- 
inant use of linguistic material inhibits the elaboration of tachistoscopic per- 
ception into a general theory of visual-perception-at-a-glance, but given the 
language-processing interests of this conference that limitation is perhaps of 
no great consequence. Let us remind ourselves that on the constructivist view 
- an understanding of perception in a single fixation is fundamental since normal 
everyday perception is construed as the fitting together of successive retinal 
snapshots. 

Iconic and Short-Term Schematic Representations 

There is one thing we can assume from the outset: if perception is a pro- 
cess over time,; t^en in the cases of brief but perceivable optical events (say of 
the order of several milliseconds, to provide an extreme instance), there ought 
to be a mechanlism that internally preserves such events beyond their physical 
duration. It 10 on this internally persisting representation that constructive 
operations are performed and in its absence we ought to suppose that the percep- 
tion of brief displays would be well nigh impossible. Furthermore, we can sug- 
gest the following about this representation: it should be in an uncategorized 
form. Suppose that the mechanisms of perception were prefabricated in the sense 
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of possessing knowledge about, and routines for, abstracting universal regulari- 
ties, and that these Regularities were automatically registered and classified. 
One consequence of this arrangement would be category blindness (MacKay, 1967). 
If a new category attained significance for the organism, the organism would not 
be able to grasp it. (-Obviously, the perceptual mechanisms, although partly pre- 
fabricated, must also be flexible. What we can imagine is that on the occurrence 
of an optical event prefabricated general procedures are brought to bear, and a 
variety of testings are carried out to determine how this event might fit into 
the current organization. This suggests (see MacKay, 1967) that th^re ought to 
be a region or **warkshop** in the perceptual system that allows for hypothesis 
testing before categorization. We might anticipate, from the constructivist 
position, the existence of a transient memory for visual stimulation that pre- 
serves the raw data in a relatively literal form. 

In many respects tne pivotal concept in the information-processing account 
of perception-at-a-glance is a memory system having *the properties suggested 
above. The original evidence for a transient, higH-capacity , literal Visual 
memory comes from the well-^known experiments of Sperling (1960) and Averbach and 
Coriell (1961). 

Brief visual storage, or iconic memory as Neisser ^1967) has termed it, was 
isolated through the use of a delayed partial-sampling procedure. Essentially, 
this procedure involves presenting simultaneously an overlcTad of items tachisto- 
scopically followed by an indicator designating which element or subset of ele- 
ments the subject has to report. If the indicator is given soon after the dis- . 
play, the aubject can report proportionate]^ more with partial report than if 
asked for a^report of the whole display. This superiority permits the inferencje 
of a large capacity store; the sharp decline in partial-report superiority ^ith 
indicator delay permits the inference of rapid decay. Purest; estimates of the 
de^^ay rate reveal that the duration pf this storage is of the o^rder of 250 msec 
(Averbach and Coriell,- 1961; Vanthoor and Eijkm^n, 1973). 
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The proof of the pfecategorical character of iconic storage is found in the 
kinds of selection criteria that yield efficient performance In the delayed 
partial-sampling task.-- Generally, superior partial report can be demonstrated 
when the items In a display are selected for partial report on the basis of 
brightness (von Wright, 1968), size (von Wright, 1968, 1970), color (von Wright, 
1968, 1970; Clark, 1969; Turvey, 1972), dhape (Turvey and Kravetz, 1970), move- 
ment (Triesman, Russell, and Green, 1974), and location (e.g., Sperling, 1960). 
Partial-report performance, however, is notably poorer (i.e., not significantly 
better than whole report) when the letter/digit distinotion is the basis for 
selection (Sperling, 1960; von Wright, 1970). On these data we can conclude 
that one can select or ignore items in iconic storage on 4:-he. basis of their 
general physical character istlcs, but one cannot with thq same efficiency select 
or ignore items on the basis of their derived properties. We might wish to 
argue, therefore, that the Iconic representation is literal. 

We can gain a richer understanding of the character of lc"Onic storage by 
comparing it with another form of visual memory that arises quite early in the 
fiow of information and that may be described as abstract or schematic. As we 
shall see, the longevity of this representation exceeds by a considerable degree 
that of the icon. 
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It has often been argued that the iconic representation of linguistic 
material undergoes a metamorphosis from a visual to a linguistically related 
form. One elegant expression of this view suggests that the raw visual data are 
cast rapidly into a set of instructions for the speech articulators for subse- 
quent (and more leisurely) rehearsal and report (Sperling\^1967) . However, as 
one would intuit, the transformation of the icon establishes not only a repre- 
sentation in the language system but, in addition, and we may suppose in parallel 
(Coltheart, 1972), a fiyrther and more stable representation in the visual system. 
The first strong experimental hints that this might be so were ]>rovided by Posner 
and Keele (1967) in an experiment that stood on the "shoulders of an earlier 
series of now celebrated experiments by Posner and Mitchell (1967). 

Consider a**8ituation in ^;5>hich subjects are presented a pair of letters and 
asked to respond ''same," if these letters have the sanfe name, and "otherwise," 
if they are different. On some occasions the letteivs with the same name are 
also physically the same (e.g.. All) and on others thk letters with the same name * 
are physically dissimilar (e.g., Aa). It proves to bte the esse that "same'* re- 
sponses to AA are significantly faster than same respd(nses to Aa, which suggests 
that AA-type pai/s^ are not necessarily being processed op the basis of name 
rather that their visu^afl characteristics are being used to make the response. ^ 
Adopting Posner's (1969) terminology, we will refer to matches of physically 
identical letters as "physical matches" arid matches of physically different but 
nominally identical letters as "name matches." - , ^.V 

We now look at what happens when the two members of a pair are pijjp^ent-ed 
sequentially rather than simultaneously and the time ^elapsing betweei?^ the appea^;- 
ance of the first and the appearance of the second is varied. Under, these condi- 
tions, the latency of a natne match is indifferent to the delays time, ^ut the * 
latency of a physical match increases with delay until it and the name match 
latency are virtually identical. The converging of physicTal- and name-match 
latencies identifies a decline in the availability of a visual code and an in- 
creasing dependence on the name of the letter with the passage of time. In th^e*^ 
original experiment (Posner and Keele, 1967), the estim^ed duration of the 
visual code isolated by the reaction-time procedure was of thfe order of 2 sec; 
however, subsequent research has shown that it is considerably more durable than 
the original experiment would have us believe (e.g., ICpoil, Parks, Parkinson, 
Bieber, and Johnson, 1970; Phillips and Baddeley, 1971). At all events^, the 
superior longevity of this visual cod6 to that exposed by t?he delayed^ partial- 
sampling , procedure (namely, the icon) suggests that we are indeed dealj^iag with 
two different visual, memotlal representations' — two diffeifent descriptions — of 
an optical event. But before elaborating on this conclusion, we ought more 
explicit about the difference between the two procedures defining the-tw<if4epicp- 
sentations. In one, delayed partial sampling, we are interested .in the persis- 
tence of aspects of visual' stimulation not yet selectively attended to; in the 
other, our concern is with the persistence of the visual d^criptioti- of stimula- 
tion that has enjoyed the privileges of selective attention. 

Beyond the difference in persistence between the two visual representations 
we can note the following differences. In the first plice, the iconic "represen- 
tation is perturbed by an aftercoming visual mask, the schematic representation 
is not (cf. Sperling, 1963; Posner, 1969; Scharf and Lef ton, 1970; Phillips, 
1974). In the second place,' the capacity of the iconic representation is prob- 
ably unlimited, but that of the schematic representation is' clearly constrained 
(Coltheart, 1972; Phillips, 1974). In the third place, if during, the existence 



of a representation demands are made concurrently on processing capacity (Posner 
1966; Moisay, 1%?) , the persistence of the iconic representation seems to be 
unimpaired' (Boost and Turvey, 1971) in contrast to the persistence X)f the sche- 
matic representatioipi, which can be shown to be severely reduced (Posner, 1969) • 
The corollary. to this latter distinction, however, is that the close dependency 
of the .schematic representation on central processing capacity means that the 
persistence of this representation is indeterminate — it may persist for as long 
as Sufficient processing capacity is devoted to it (Posner, 1969; Posner, Boies, 
Eichelman, and Taylor, 1969; Kroll et al. , 1970). 

Ond last difference is worth attention. The high-capacity, maskable iconic 
representation is tied to spatial position; the low-capacity, unmaskable sche- 
matic representation is not (Phillips, 1974). On ithe basis of the schematic rep 
resent'ation, the perceptual system can match two successive optical events when 
they are spatially separate as efficiently as it can when they are spatially 
congruent. On the basis of the iconic representation, however, a match of suc- 
cessive ^nd spatially separate events is conducted less efficiently than a match 
of successive events that spatially overlap (Phillips, 1974). 

We could go farther in our discussion of these two largely different de- 
scriptions of the visual appeaxance of a stimulus but what has been said is suf- 
ficient for our purposes. Let uS therefore conclude this discussion witli a 
crude sketch of the role played b^ the iconic description in the general scheme 
of visual information processing; , / 

For current theory, memory consists of two basic structures, termed "active 
and "passive," or alternatively two basic models, the "short-term" and the "long 
term." JThe relation of the iconic representation to these two structures is not 
perfectly" obvious, although an appeal to the general consensus of opinion 
(Neisser, 1967; Sperling, 1967; Haber, 1969; Turvey, 1973) informs us that the 
icon is a necessary precursor to active memojry and that It interfaces visual 
stimulation with internal mbdels. Thus tn/zhts view the iconic representation 
is a transient state that, in the handa of long-term knowledge structures, is 
moulded into a variety of representations in active memory. On the foregoing 
account, we may identify the schematic visual code and the name code as examples 
of active representations so produced. Figure 4 captures these ideas. 

V 

Temporal^Characteristics of thfi Iconic Interface ' 

t How Jmlght we exa'mlne the fine temporal grain of events surrounding the 
iconic in^rface? One tool that has proved reasonaT5ly successful in this regard 
is visual masking — the impairment in the perception of one of a pair of stimuli 
when the two are presented in close temporal succession. There are a number of 
renderings of this phenomenon, some of which are particularly significant to the 
discussion that follows. First, the influence of the second member of the pair 
on the^ first, which is referred to as "backward masking"; and the influence of 
the first, on the second, which is termed "forward masking." Second, the two 
stimuli may be presented binocular!^, i.e., both stimuli are presented to both 
eyes, qr monbptically, 'i.e>^, both stimuli are presented to one eye; or may be 
presented dichoptically , in^ which case one member o/ the stimulus pair is pre- / 
sented to one eye sxid the other member is presented to the other eye. Masking 
that occurs under conditions of binocular or monop tic viewing may originate in 
either peripheral or central visual mechanisms, but masking that occurs in 
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dlchoptlc viewing is more likely due to central effects. Third, the two stimuli 
may or may not overlap spatially. We reserve the special term "metacontrast" 
for the case in which backward masking occurs with nonover lapping but spatially 
adjacent stimuli. 

Essentially, there are two major interpresentations of masking effects • If 
we refer to the stimulus to be identified as the target and the stimulus impeding 
perception as the mask, then one interpretation, the integration hypothesis 
(Kahneman^ 1968), stresses the effect that the mask has on the target representa- 
tion. The idea is that the two stimuli that follow one another in rapid succes- 
sion are effectively simultaneous within a single "frame" of psychological time, 
atfialbgous to a double exposure of a photographic plate. The result of thi3 sum- 
mation favors the higher-energy stimulus; thus, if the mask^s.oE greater eiie'rgy, 
then the target will be reduced" in clarity and its identification impeded. 
Closely related to this interpretation is the notion thatj where a greater-energy 
mask follows a target the neural response to the target is occluded. This view 
is often dubbed "overtake" and it may be thought of as an Integration emphasizing 
a nonlinear summation of responses rather than a linear summation of stimuli 
^(Kahneman,. 1968) ... 

A contrasting interpretation of masking is presented in the form of the 
interruption hypothesis (Kahneman, 1968): if a mask follows a target stimulus 
^fter some delay, processing is assumed to have occurred during that delay but 
is terminated, or interfered with by the mask* In the context -of the iconic rep- 
resentation, we can view the interruption hypothesis as saying that an a'ftercom- 
ing stimulus does not affect- the accuracy of the iconic description of a prior 
stimulus but rather interferes with its translation into active memory codes. Jn 
this same context it is evident that the stimulus version of the integration" 
hypothesis indicates that the target and mask are deaJLt with as a composite, 
resulting in an iconic representation in which the target Is unintelligible. The 
response version of the same h3rpothesis suggests that the iconic description of 
the target ' is never formed. ^ a ' 

I 

When looked^ at in terms of the icon, the two interpretations of masking do 
not appear as competing explanations^ rather they seem to be interpretations of 
masking originating at different stages in the flow of visual information 
(Turvey, 1973). ' 

We can provide some measure of proof for this point of view through an 
experiment of the f tallowing kind.,^ A target stimulus, say a letter, is presented 
briefly (<10 msec) to one eye followed shortly afterward (say, 0 to 50 msec) by 
a similarly brief exposure of a masking stimulus to the other eye. A contoured 
mask with features in common with th^ targeft will seriously impair the perception 
of the target in *this dichoptic situation but, by way of contrast, a noncontoured 
mask that bears no formal relation to the target, such as a homogeneous flash' or 
a fine-grain random^dot pattern, will not; even if ^within limits, its energy is 
maide considerably greater than that of the target (Schiller, 1965; Turvey, 1973). 
Yet if a noncontoured mask is presented to the same eye as the target, masking 
can be demonstrated. On these observations it may* be aruged that with respect 
to a contoured target, a contoured mask can exert a central inf luen^e\^but a non- 
contoured mask cannot. The influence of the latter is limited primarily to the 
peripheral processing of the target. 
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The gist of the whole matter Is given In an expeijjnental arrangement In 
which a cbntoured target and an aftercomlng contoured mask are presented to 
separat^/ exes with a noncontoured higher-energy mask following shortly thereafter 
on the !3ame eye^k^s^the contoured mask. The perceptual outcome of this apparently 
complex configuration of events Is singularly str^alghtforward : the target can be 
seen and Identified agafnst the unlmpedlng background of the noncontoured mask 
(Turvey, 1973). Although straightforward, the outcome Is curious. It Implies 
that the second arriving mask occluded the first arriving mask In the "perlph-- 
eral" sequence of neural transformations from retina to cortex. That Is to say, 
the first mask that dould Impede target perception centrally never In fact 
reached central processing mechanisms and the perceptibility of the target was > 
thus unhindered. Thus, one might venture to propose that different kinds of 
masking obeying different principles originate at different stages In the flow 
of visual Information on the nervous system 

Consider the monoptlc masking of letter targets by a masker of equal or 
greater energy that Is relatively Ineffectual dlchoptlcally. It has been demon- 
strated that In this situation the following principle relates target duration 
to the minimal Interstlmulus (target offset to mask onset) Interval permitting 
evasion of the masking action: target duration x minimal Interstlmulus Interval = 
a constant (Turvey, 1973; Novik, 1974). We owe the first demonstration of this 
relationship to Klnsboume and Warrington (1962a) . 

When examined more closely, target energy rather than target duration 
emerges as the true entry Into the relation (Turvey, 1973). Furthetmore, It Is 
of some Importance that the relation holds for forward as well as backward mask- 
ing, although the constant for the forward relation Is higher than that for the 
backward (Klnsboume and Warrington, 1962b; Turvey, 1973); In brief, forward 
masking In the domain of the multiplicative rule extends over a greater range 
than backward masking. 

' /' 

My own treatment of the multiplicative function Is that It characterizes 
peripheral processing. If we assume that there are a large number of Indepen- 
dent and parallel networks detecting features and/or spatial frequencies (see 
below) , then the relation tells us, or so I argue, that the rate of detection Is 
a direct function of target energy. 

'We should Inquire as to the relation between target duration and the minimal 
Interstlmulus Interval when the two stimuli are presented dlchoptlcally and the 
masker Is a contoured, effective dlchoptlc-masklng agent. In this situation,/ 
the relation proves to be additive: target duration + minimal Interstlmulus 
Interval = a constant (Turvey, 1973; Novlk, 1974) ^id^here target duration is the 
proper entry. Indeed, this dichoptic or central principle tells us that the 
total time elapsing between stimulus onsets is what is significant. While the 
energy relation between target and mask is of major importance to peripheral 
processes, it is of limited importance to central processes. 

Perhaps the brunt of the peripheral/central difference is carried ,by the 
contrast of forward and backward masking. We' recall that peripherally forward 
masking was greater than backward even though both obeyed the same principle. 
Contrapuntally , central forward masking is slight in comparison to central back- 
ward masking (Smith and Schiller, 1966; Turvey ,• 1973) . We see that central 
masking is primarily backward and this may be an important comment on the nature 
of central information processing (cf. Kolers, 1968). 
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When two successive stimuli compete for the services of the same central 
processes, it is the later arriving one that is more completely identified. On 
the other, hand, whin two stimuli compete for' the game peripheral networks, order 
of arrival is less important than energy. Peripherally the stimulus of greater, 
energy, whether it leads or lags, will be the pne whose properti^ are likely to 
be registered. 

We ought to ask how the two rules* described above relate to each other. On 
the hypothesis that the multiplicative rule relates to primarily peripheral pro- 
cesses and the additive to central, our question becomes that of -how peripheral 
and central processes combine. At first blush we might think that this relation- 
ship is successive and- additive; that is to say, peripheral processing is com- 
pleted first, then central processes occur, and total processing time is given 
by adding the two stages. It proves to be the case, however, that a successive 
and additive interpretation will not do (Turvey, 1973: Exp. 12). A more reasonr- 
able interpretation is that the output from peripheral networks is parallel and 
as3mchronous — different features are detected at different rates (Kolers, 1967; 
Weiss tein, 1971) — and that central processes, obviously dependent on peripheral 
input, operate simultaneously with peripheral processes. The relation is said 
to be concurrent and contingent (cf. Turvey, 1973). 

Identifying Feature-Detectors in Hdman Information Processing 

We noted earlier that "information processing" as a methodology for examin- 
ing transformations of neural states may not be able to give us a measure of 
vision's structural grammars nor a sufficiently detailed account of how represen- 
tations commute. We may now turn to a particularly significant contribution of 
the approach: its ability to suggest or demonstrate feature-sensitive systems 
in human vision. • ' 

It is well-known that therfe are cells in the visual cortex of the cat and 
monkey selectively sensitive to the orientation of lines. How might we reveal 
the presence of such units in human vision? A demonstration is important because 
our constructive theories of himan pattern and object recognition presuppdse the 
existence of these units. 

A phenomenon well-suited to this purpose is the "aftereffect." The logic 
behind visual aftereffects as a technique is that prolonged viewing of a stimu- 
lus consisting of a particular visual feature should selectively depress the 
mechanism detecting that feature. We can take advantage of a fairly general 
rule: if any form of stimulation is continued for a long time and then stopped, 
one will tend to experience the reverse condition. 

0 

Orientation specific -systems in human vision are hinted at by the tilt 
aftereffect (Gibson and Radner, 1937). Suppose that one was exposed for a 
period of time to a high-contrast grating tilted slightly to the right of verti- 
cal. A subsequently exposed vertical grating (or vertical line) would then 
appear tilted slightly to the left. This is termed the "direct effec^t." The 
fact that exposure to a vertical grating tilted clockwise will cause a horizon- 
tal grating or line to appear counterclockwise from the horizontal is termed the 
"indirect effect" (see Coltheart, 1971). An eminent example df this aftereffect 
is provided by Campbell and Maffie (1971). The evidence is that the human 
visual system exhibits a narrow orientational tuning of the type reported for 
units in the 0at and monkey (H^bel and Wiesel, 1968). 
'-<» ■. / 
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A ^particularly intriguing aspect of the tilt aftereffect is tfiat it can be 
color-specific. If one is first exposed to red stripes tilted clockwise off 
vertical and green stripes tilted off ^vertical by the game amount but counter- 
clockwise and then one examines d^^^rjlli^l' test stripe^ its apparent orientation 
will depend on th^ color of the'^^pifcf^v^^^ it is projBCte^. Projected in 
red light, it will, appear tilte^i^ijtfte^ projected in greerr light, it 

will ^appear tilted clockwise (Hel^lanM'' S^ttuck,' 1971) . This orientation after- 
effect^ specific to color should bi^^tr as tqd with the now famous McCullough 
phenomenon, which is a color af ter^^et CQi^tingentf on orientation (McCullough, 
1965) . In the latter, viewing altem^l^^^'ar vertical grating on a blue back- 
ground and a horizontal grating on an of^ge background induces an orange after- 
image when vertical lines on a white fgrouAd ^e inspected and a blue afterimage 
when horizontal lines on a wl^J^te ground are inspected, 

A variety of ccjntingent aftereffects has now been demonstrated. A repre- 
sentative but not e^diaustive list would look like this: motion-contingent calor 
aftereffects (Stromeyer and Mansfield, 1970); color-contingent motion aftej^ 
effects (Favreau, Emerson, and Corballis, 1972); texture^contingent visual motion 
aftereffects (Mayhew and Anstis, 1972; Walker, 1972) and curvature-contingent 
color aftereffects (Riggs, 1974). 

Let us consider one further example of how the aftereffect phenomenon demon- 
strates that different characteristics of visual stimulation can be processed 
selectively. If one views a contracting simple arithmetic ("Archimedes") spiral 
for some period and then directs the eyes to another object, one experiences an 
equally great but opposite effect, i.e., the object will appear to expand. But 
suppose that the adapting spiral subtended only 4** of visual angle and that fol- 
lowing prolonged viewing one looked at a large and regular 20 x 20 matrix of 
squares. A region of the matrix subtending about 4** will appear to enlarge and 
even to approach. The lines, however, will not look curved nor will there be 
any interruption between the growing part of the matrix and the remainder of the 
matrix. As Kolers (1966, 196^ remarks, part of the figure seems to change in 
size and position without lo(|||[ng discontinupus ,with the remainder. 

Although aftereffects are curious and engaging, their usefulness as a 
source of information about feature analyzers in human vision has not been en- 
dorsed uncjritically by all students of perception. Weisstein (1969) sees the 
principal weakness of the technique as that of examining what is left after 
adaptation. It is reasonable to conjecture, and indeed the contingent after- 
effects bear this out, that the aftereffect manifests a pooling of many differ- 
ent types of analyzers. Moreover, a number of different populations of analyzers 
could give the same phenomenal result. Other students have raised similar doubts 
about the integrity and fruitfulness of aftereffect data for the exploration of 
feature detectors in hum^n vision (Harris and Gibsonf 1968; Murch, 1972). 

An alternative strategy is one that is referred to as ctoss adaptation. 
The essential characteristic of cross adaptation is that one examines the loss 
of sensitivity to one pattern given a 'preceding exposure to another. The measure 
of change in sensitivity contrasts cross adaptation with the "aftereffect," which 
is. concerned with the degree to which perception is reTfersed. When a grating is 
viewed for some period of time, the thresholds for the same and similar gratings 
are raised, but thresholds for gratings differing in brientation and size are 
virtually unchanged (Blakemore and Campbell, 1969). This may be taken as 
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evidence that different populations of neu^^ respond differentially to features 
of a stimulus. . ' ' 

- . . ^ • ' " 

There Is an especially provocative, application of this procedure that allows 
us to draw together several strands of this <t±scusslon. Welssteln arid her co- 
workers (Welssteln, 1970; Welssteln, Montalvo, and Ozog, 1972) were motivated by 
an aspect of the theory of object recognition- that Is. prominent In scene analysis 
programs but neglected In psychological experiments. In scene analysis, con- 
structing a representation involves abstracting certain entities, identifying 
their attributes, and specifying the relationships among them. Virtually all 
aftereffects and 'all cross-adaptation experiments are directed toward isolating 
and defining entities. But of interest to Welssteln (1970) was' the possibility 
of demonstrating relations among -entitles in the scene domain. She inquired 
whether one could s'electively adapt the neural structures responsible for the 
relation "in back of." Her eKperiment was deceptively simple. Subjects in- 
spected a vertical grating that was partiall^bld^^d from view by a perspective 
drawing of a cube. A subsequent vertical t^t grating was presente^d within the 
portion of the visual field where the prior grating was vlsiblle and also where 
It was not (i.e., in the region covered by the cube). A reduction in apparent- 
contrast (adaptation) was found for both positions of the test grating. This • 
means that a population of size-selective and orientation-selective systems that 
was adapted out by the original grating was also adm>ted out (although not to 
th^ same degree) by the cube. Yet it could be dgmomtrated that a cube by it- 
self (I.e., not coverJLng a grating) does not induce^ significant adaptation 
effect nor does a hexagon outline drawing of a cube that partially occludes the 
same area of grating (Welssteln et al. , 1972). Apparently, what is important to 
the effect is the impression of depth given by the perspective drawing of the 
cube, and one might interpret this result to mean that thfe relation "in b^ck of" 
is specified by the firing of those cells that would have fired if the grating 
had been visible in the region covered by the cube. In Weisstein's view, this 
effect implies the involvement of neural mechanisms that separate a scene into 
its components — analogous perhaps to the operations we witnessed in the computer 
programs described earlier. 

Information Processing as a Coalltlonal Skill 

We can conclude this brief and selective account of constructivism with an 
emphasis on the coalitidnal/heterarchlcal conception of how perceptual procedures 
are organized.* What follows is a potpourri of curious experimental observations 
that implicate (but do not necessarily dictate) the form of organization en- 
countered in our earlier discussion of seeing machines. 

Oriented-llne detectors play a significant role in theories of object recog- 
nition. Generally they are assigned to an early stage in a hierarchically or-^ 
ganized processing scheme. However, their actual status within the scheme and 
"the scheme's structure are less than obvious, as witnessed by an experiment of 
Welssteln and Harris (1974). They demonstrated facilitation of "feature" (line) 
detection by object context. This work is matched by a facilitation of "object" 
detection by scene context elegant!^ revealed in the work of Blederman and .his 
colleagues (Biederman, 1972;^ Blederman, Glass, and Stacy, 1973? Blederman, 
Rabinowltz, Glass, and Stacy, 1974). An object is more accurately Identified 
when part of a briefly exposed real-world scene than when it is part of a jumbled 
version of that scene, exposed equally briefly. 
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Thfese observations are puzzling on the assumption that tifie detection of 
^fragmjantB of a pattern predates the determination of the global structure and 
. Identity of a uattem. Thus, paradoxically, the Identity of the whole depends 
on the Identity* of Its fragments, but the .perceptibility of a fragment Is deter- 
mined by the ijhol6 in which it la embedded as an Integral part. Significantly, 
where a fragment^ is immaterial to the, global structure its presence is more ' 
likely to be obscured than enhanced (see Rock, Halper, and Clayton, 1972). 

•This paradox is also apparent in the perception of linguistic material. It 
can be demonstrated that a letter is perceived more readily in the context of a 
word than in the context of a meaningless string of letters (e^g., Reicher, 
1969; Wheeler, 1970; Johnston and McClelland, 1974). Though some have suggested 
that results of this kind can be interpreted solely in terms of the superior 
orthographic/phonologic regularity in the word (cf. Baron and Thurs tone, 1973) , 
others have sought to demonstrate the significance of meaning. Controlling for 
orthographic /phonologic regularity (as best one can), Henderson (1974) has shown 
that meaningful letter string^ such aa VD, LSD, YMCA, are compared faster in a 
binary classification task tha^ meaningless strings, such as BV, LSF, YPMC. 

There is some motivation for interpreting results of the lattet kind in 
terms of a direct accessing of semantic knowledge. For example, it has been 
shown that the time to reject a meaningful consonant triplet as a word is longer 
than the rejection latency for a meaningless consonant triplet (Novik, 1974). 
This observation contradicts the idea that an analysis of orthographic/phonologic 
regularity precedes entry Into the lexicon and suggests to the contrary that 
lexical access Is temporally contiguous with structural analysis. One conception 
that may useful here is that of Henderson (1974): feature analysis has paral- 
lel access to various memory structures or domains (in the vernacular of our 
. earlier discussion) — grapheme knowledge, orthographic rules, a content-address- 
able semantic base — though consulted in parallel, the domains interact coalition- 
ally through rules that mJp from one to the other. Presumably perceptual deci- 
sions are reached through this mutual cooperation among separate domains. ' 

It is worth remarking further on the idea of a direct mapping between script 
and meaning. Several scholars have found this notion necessary to their accounts 
of skilled reading (e.g.. Bower, 1970; Kolers, 1970), and there are a number of 
provocative clinical observations that speak in its favor. To take but one 
example^ a paralexic error (though not a common one) is to read a word as a 
semantic "relative; thiis hen is read as *'egg'* (Marshall and Newcombe, 1971). The 
reader in this instance cannot identify the word, nor can he give a phonetic 
rendering of it, but he can relate to its semantic structure. There are paral- 
lels to this clinical observation to be found in the visual masking literature, 
namely, experiments suggesting that an observer may have some knowledge of the 
meaning of a masked word even though he may be unable to report the actual 
identity of the word (Wickens, 1972; Marcel, 1974). 

There is a second batch of curious results, much related to those just de- 
scribed. Suppose that one is asked to scan a list of items in search of a 
specified target item. As a first approximation we can say that the perceiver 
looks for the set of features (or an appropriate subset) that defines the target 
item. In this case, nontarget items, or foils that share many of the target's fea- 
tures, will be a greater hindrance to the search than foils that are less closely 
related. Experimental evidence tends to bear this out (Nelsser, 1967). Unfor- 
tunately, this first aprroxlmation is challenged by visual scan experiments in 
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which the target is drawn from a conceptual category different from the foils — 
the target, say, is a letter and the foils are digits, or vice versa. Here the 
evidence is that the time to find any letter (digit) in a list of digits (letters) 
Is less than the time it takes to find a particular digit (letter) in a list of 
digits (letters) (Brand, 1971; Ingling, 1972). The implication is that category 
discrimination can precede character identification: one can know that a char- 
actei: is a letter or digit before one knows what letter or digit it is. Our 
puzzle is:^ What are the features that define the category of letters on the one 
hand and digits on the. other? This puzzle is compounded by the following experi- 
ment, which plays on the ambiguity of the character "0" (it may be interpreted as 
either the digit zero or the letter "oh") . When "0" is embedded in a list of 
digits, it can be found more rapidly if the observer is told that she is looking 
for a letter than if she is told that she is looking for a digit. Conversely 
when "0" is a member of a list .of letters, latency of search is considerably 
shorter . if one is looking for the digit zero than, if one is looking for the 
letter "oh" (Jonides And Gleitman, 1972). This result is not restricted to the 
digit/letter distinction, for research at the University of Belgrade reveals the 
same pattern of findings when the target is an ambiguous letter, that is, a 
letter having one phonetic interpretation in the Cyrillic alphabet and another 
in the Roman alphabet*. (The popular use of two alphabets is an interesting 
feature of the Serbo-Croatian language.) 

ifiough we cannot provide an explanation for thel^ phenomena just , described, we 
can appreciate their implications for the modeling of the human as an^nf ormation 
processor. Whatever * procedures are presumed to be involved in the prt)cessing of 
information, it can be hypothesized that their manner of interrelating is not 
obligatory. The nature of the' task constrains the structure of the coalition 
with different tasks requiring diffe^rent coordinations of procedures. In this 
sense, information processing is a coalitional skill. ^ 

PERCEPTION AS PRIMARY AND DIRECT; THE GIBSONIAN ALTERNATIVE 

Introduction 

Since constructivism starts from the assumption that stimuli are informa- 
tionally inadequate, then it is obvious why the primary qoncem of perceptual / 
theory is taken to be the investigation of the how of perception. Constructiv- 
ism encourages an inquiry into memory structures and cognitive operations that 
mediate cues, punctate sensations, features, or whatever, and the perceptual 
experience. On this view internal models enable adaptation to a world poorly 
signaled by the flux of energy. Thus it is the internal models, their acquisi- 
tion, and their usage, that we seek to understand. 

In polar opposition to this strategy stands Gibson's suggestion that we ask 
not what is inside the head — as the constructivists would have it — but rather what 
the head is inside of Ojfiace, in press). For Gibson the question of the what of 
percjeption has been given short shrift, and in his view we are burdened with ex- 
cessive theoretical baggage relating to the how of perception for the very reason 
that what there is to be perceived has not been seriously examined (Shaw and 
Mclntyre, 1974). 



Georgije Lukatela, 1975: personal communication. 
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Gibson begins with the question: What do terrestrial environments look 
like? A departure point curiously ignotied by those who would build general 
theories, of visual perception. Indeed, it can be argued that the constructivist 
: approach ^cT vision has not broken kinship with the school of thought that gave 
rise to liolyneux's premise in the seventeenth century (Pastore, 1971). By tak- 
ing empty Euclidian space as the frame of reference, the intellectual predeces- 
sors of modem constructivism were forced into the position of arguing that d is- ' 
^ tance, for exjample, could not be apprehended through vision. Distance cbuld 
only be arrived at (constructed, inferred) through the supplementary information 
provided by past experience and represented in the form of kinesthetic and tac- • 
tile images. A moment's examination of the classical exp^osition of space percep- 
tion will stand us in good stead. 

In the classical view the third dimension was construed as a straight line 
extending outward from the eye. But since physical space was interpreted as an 
empty Euclidean space, nothing existed between the eye and an object fixatj?<l«c, 
One might refer to the theory derived from this conception as the "air theory" 
of space perception (Gibson, 1950), for it implies an observer looking at unsup- 
ported objects hanging in midair. (We can readily admit to the unnaturalneks of 
this characterization, but our criticism must be held in abeyance for the time 
being; a great deal of modem theory has been erected on this way of describing 
perceptual situations in the abstract, and it is incumbent upon us to appreciate 
fully the point of view.) 

\ 

On the "air theory" formulation, an object fixated by an observer projects 
a two-dimensional form on the retina that relates to the size and outline of the 
exposed face(s) of the object. What needs to be explained is how the object's 
distance is perceived. To grasp the nature of this problem, consider Figure 5, 
which portrays in traditional fashion a number of objects at varying distances 
from the observer. The size of each object's projection onto the retinal sur- 
face is a function of the visual angle formed by the light rays from the extrem- 
ities of the object. We have chosen our objects and their slants such that the 
visual angl^ projected by each is the same. Clearly ,** size on the retina does 
not unequivocally specify distance and we are led to conclude that retinal in- 
formation, and thus vision, is itisuf f icient for object-distance perception. It 
necessarily follows that for perceiving the distance of objects other information 
must be supplied, supposedly from the observer's memory banks. For insjtance, 
the transactionalists (e.g., Ittleson, 1960) looked to one's familiarity with ob- 
jects (and thus with their actual dimensions) as the relevant memory material. 
But this left unsettled the problem of perceiving the distance of unfamiliar ob- 
jects. For an alternative one could look to the information available in the 
converging of the two eyes (cf. Gregory, 1969). Supposedly the angle set between 
the eyes when converging on an object is a cue to its distance. But there is a 
good reason to doubt the efgLcacy of convergence (Ogle, 1962) and in any event 
such a cue is not at the disposal of many animals, namely, those with nonconverg- 
ing eyes who do perceive object distance. Furthermore, we can readily imagine 
the problem pictured in Figure 5 as being that of the limited information to a 
single eye and we can, after all, perceive distance monocularly. 

In his 1950 text Gibson responded to the classical treatment of space per- 
ception in a way that was both elegant and simple. To begin with he translated 
the abstract question, "How is space perceived?" into the biologically and eco- 
logically meaningful question of, "How is the layout of surfaces detected?" More 
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especially » Gibson asked, "How, from a point of observation, do we see continuous 
distance In all directions?" rather than the commonplace experimental question of 
"How do we judge the distance of these two objects?" The Importance of an Im- 
pression of continuous distance Is that It underlies our capability to perceive 
the distances of any number of objects In a field of view. 

In the next step, Gibson replaced the traditional conception of an Isolated 
ey^vlewlng mathematical points In empty space with that of an eye attached to 
— a body In contact with a ground surface viewing points attached to the surface . 
In this account of the problem It Is evident that by the laws of linear perspec- 
tlve the retinal Image Is structured In a way that corresponds unambiguously to 
the distribution of points. (Figure 6 contrasts the classical and Glbsonlan 
accounts.) Moreover, If we now replace the pplnts with objects, then It Is 
similarly evident by the laws of linear perspective that the projection of these 
objects-on-a-surface will be such that near objects will be Imaged large and 
high up on the retina, while far objects will be Imaged smaller and lower down. 

Although vwhat has been said so far does not do justice to Gibson's current 
thinking ICpartlcularly the references to retinal Image) ,j It Is clear that on 
this rpinterpretatlon of the space perception problem the Information to an eye 
Is conceivably richer than traditional theoty would have us believe. Indeed, 
one might Iventi&'e to say that the light to an eye could be structured In a way 
that corresponds to the layout of points and objects on a ground surface. 

Let u^ return to the cl|sslcal view In order to Illustrate Its pervasive In- 
fluenc|g on Wlslon/theory . Earlier we examined the motivation for assuming that 
perception toes beyond what fa given, that Is, the view of perception as Indi- 
rect. | In tl^e^'classlcal theojry of space perception we have the original Impetus: 
t:he third dimension Is lost kn the two-dimensional retina. The Immediate theo- 
retical consequence of this Tloss" Is obvious and we have already deliberated on 
it.\ But It Is worth remarking once more: the retinal Image Is held to be a 
^flaV*^patchwork of colors, more precisely of colored forms, to which one can add 
a third dimension by using available cli;es, some of which are given directly and 
some of which are themselves pons true tlons (e.g., superposition). 

Of the many conceptual ptrogeny sired by the classical story the following 
two are among the most significant. First Is the notion, now Inviolate, that 
the two-dimensional retinal Image Is the proper starting point for any theory of 
seeing. It Is this notion thai legitimizes, among other things, the enterprise 
of buJ?ldlng theories of visual perception on tljie shoulders of experiments In 
picture percepitlon. Moreover, since perceiving begins, with a flat patchwork of 
colored forms. It gives to the theories of form perception and color-patch per- 
ception a special status. They are, as It were, propaedeutic to the theory of 
visual perception. 

«• 

^ Second Is the Idea that one should exai^ne the retinal Image for copies of 
environmental aspects, an4 where copies are not found, those aspects are said to 
be Inferred, guessed at, or created. For Instance, the third dimension does not 
have a copy In the retinal Image and so It must be constructed; similarly the 
shape of an object (Its structure In 3-space) Is not replicated In the Image and 
so It must be constructed or Inferred from the' two-dfmenslonal outline of the 
object, which /Is replicated In the image; and by the same token the arrangement 
of flashes fpr the phi phenomenon does not produce a retinal copy of physical 
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movement defined as object displacenierit in space — ^^so the impression of njovement 
in this situatign must be a mental creation. ^ - ^ 

Gibsqin's (1950,* 1966) Contention is that this way of conceptualizing -visual v 
perception, th§t is, along the ^ lines suggested in the classical theory of space ^ 
perception, is blatantly in error. Among his reasons for thi*s contention, the 
following bear directly on the classical position. For Gibson there i& no such 
problem as the problem of depth or, space perception. The very concept of "^space" 
as empty Euclidean space. is irrelevant to discourse on perceptionj and its use t 
in psychology, he argues, is confused and confusing. Animals, he will 'tell us 
Repeatedly, do' not perceive spate. We have misconstrued the problem: what 
animals perceive is the layout of surfaces. And the significance of this re- - ^ 
statement of the problem.!^ that in the light there is information for the pet- 
ceptlon of surfabe layout^-rthus surface layout may not have to be created or 
guessed at, it can be sensed in the meaning of detected. 

If Gibson's pointy of view is^ valid,' then many concepts that evolved with the 
attempts to solve the space perception problem, and that were perpetuated by its 
classical solution, may have to be discarded. Let us therefore proceed to the 
question of what terrestrial environments look like. 

Ecolog^al - Optics \ 

The tenor of Gibson's denunciation of classical theory is that evolution did 
not devise visual systems to operate on a vacuous Euclidean space but to detect 
and interact with the p);operties of cluttered environments. The environments of 
animals and man induce inhomogeneities in reflected light, and it is Gibson's 
principal, guiding intuition that environmental events structure light in ways 
that specify their properties . ^ The proof of this intuition restfi^ with "ecologi- 
cal optics" (Gibson, 1961), an enterprise that seeks ^o determine what is con- 
tained in the light: if it can be demonstrated that the light is structured by 
environmental events in a specific fashion, then' it is said that the light con- 
tainrf information for these events. In G^lson'^s view we should look for tighter 
correspondences between environmental fa6ts and the light as structured by those 
facts. But all of the above is by way of a sumnary, and to be appreciated it 
requires that we reexamine the conception of 4ight with reference to visual per- 
ception. 

The terrestrial environment is manufactured from solids, liquids, and gases 
of various chemical '^ESejmpdfiition. The interfaces between these phases of matter 
are what we normally refer to as surf apes. Surfaces can be said to have struc- 
ture — that is to say, they are textured — where the texture (or grain) consists of 
elements of one kind or""another duplicated over the entire surface. The "signa- 
ture" of a surface is given by the cyclicity or periodicity of these textural 
elements. 

But significantly the terrestrial environment as a whole possesses structure , 
at all levels of size — at one extreme it is structured by mountains; at the 
other," by pebbles and grass (Gibson, 1966). And equally significantly the struc- 
tute at one level of size is nested in the structure at the next, higher level of ^ 
^ size. Each component oX the environment can be said to consist of smaller com- 
ponehts; facets are ne^ed within facets, and in the most general of senses, forms 
are nested within^ f Thus the texture of the individual brick is nested 



ERIC 4 0' 



within the texture of the wall 4s given in the arrangement of bricks. We see 
that it is considerably more-apt to treat the structure of the environment as 
hierarchic, than as mosaic. 

The natural environment receives its illumination from the sun, although 
our living and working environs are mostly ili^umina ted by man-made sourcei^* of 
radiant light. Not all surfaces are directly illuminated by radiant light; 
some, occasionally most, are illuminated by the light reflected from other sur- 
faces, llatural surfaces tend to be Opaque rather than transparent^ which means 
they refledfe light rather. than' transmit it. More precisely, opaqi^ surfaces 
reflect some por,|:ion of incident light and absorb the remainder as a function of 
their chemical composition. A significant feature of natural surfaces is that 
they do not reflect light like a mirror, rather they scatter it. This scat^:er- 
re flee tanker is of two kinds: selective for wavelength and unselective for i^ave- 
length, -^i^^^ combination, gradations in these two kinds of reflectance^ and com- 
plex differences in reflectivity — in scattering — due to a textural difference, 
mean that each different surface modifies light idiasyncratically. ' . . ^ 

Surfaces, of course, articulate at various angles to each other and when 
illuminated reflect light among themselves in multiple fashion. The consequence 
of this is a "flux of interlocking reflected rays in all directions at all 
points" (Gibson, 1966:12), and, therefore, the light that impinges on an observ- 
er in the transparent medium of an illuminated environment is primarily indirect, 
reflected light. 

We need a term to distinguish this light fr^m the radiant light with which 
we have become most acquainted through physics. Gibson (1961, 1966) suggests 
"the term "ambient" for radiant light that has been' modulated by an arrangement of 
surj^ces, that is, by an environment ." ' , 

^ There are ffeveral significant distinctions to be drawn between radiant and 
ambient light. Ambient light is in reference to an environment and radiant 
light is not. Also the two may be distinguished in that ambient light coi|verges 
to a point of observation, while radiant light diverges from a source of energy. 
In an illuminated room, wherever one stands there is a sheaf of rays cortverging 
at that point. This is a consequence of the multiple reflection (reverberation) 
of radiant light from the walls, ceiling, and floor. ^ An illuminated medium 
therefore is said to be packed with convergence points, or points of observation 
(Gibsbn, 1961, 1966). 

To the bundle of light rays converging to a point of observation we give a 
special name — the optic array (Gibson, 1961, 1966). Now because surfaces differ 
in the amount of incident light falling on ^ them, and in inclination, reflectance, 
and texture, the optic array at a point of observation will consist of different 
intensities (and different spectral values) in different directions. In short,, 
the .optic array has structure and we may in consequence of the conclusion i<!ten- 
tify a further distinction between radiant and ambient light: *the former has no 
structure and is simply energy, the latter is structured and is potentially in- 
formation. 

5 — 

What is implied by the radiant/ambient distinction? Assuming that one does 
not find the distinction quixotic, the foremost implication is that light as 
stimulation for the ocular equipment of man, animal, and insect has not been 
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properly described by physics . Radiant light has been lucidly portrayed, but 
, not ambient light; and If the preceding Is to be respected. It I3 ambient light 
that permits perception of the environment. 

. t. ' ■ 

However, If, following tradition, we remain true to the persuasion that 
radiant light Is the valid and only starting point for the theory of visual per- 
ception, then we'must conform to the task of explaining how* the richness and 
variety of the visual experience is derived from intensity and wavelength, the 
variables of radiant light to which brightness and hue correspond as the basic 
variables of perception. With radiant light as the starting point, we are vir- 
. tually condemned to a theory of indirect perception 

What befalls us when we take ambient light as the point, of departure? To 
begin with, we lose the security of physics, for the stimulus variables of an 
optic artay are not readily couched in the fundamental physical measures. We 
gain, however, a new perspective on the problem of visual perception, for with 
ambient light the variables of stimulation are potentially as rich as the vari- 
ables of experience. Let us dwell, therefore, a little longer on the concept of 
the optic array. 

An especially Useful description of the op tic array is that it consists of a 
finite (closed) set of visual solid angles wi^h a common apex at the point of 
observation and with a transition of intensity separating each solid angle from 
its topological neighbors (Gibson, 1972) A component solid angle of an array 
corresponds to a component of the environment — and just, as the environment" is 
hierarchically structured, so it is with- the optic array: visual solid angles 
are nested within visual solid angles, and more generally , forms are nested ' 
within forms. A few of the variables of the optic array can now be noted: ' 
abruptness and amount of intensity transition between visual solid angles; dens- 
ity of intensity transitions within a visual solid angle and change of density 
across visual solid angles; rates of change, or gradients, in-^ the density of 
intensity transitions and, with movement, rates of rates of change. This trun- 
cated inventory should suffice to suggest that the variables of ecological stim- 
ulation are both much richer and of a higher order than those normally adduced 
for the energy impinging upon an or^ganism. 

In summary, e3q)lanations of visual perception have traditionally been pro- 
posed with reference N:o the limited varlaMes of radiant light. It would seem, 
however, that ambient light, that is, radiant light as structured by an environ- 
ment, is the proper reference. The variables of ambient iight are complex and 
probably limitless. Moreover, there should be a strict correspondence between 
an ambient optic array and the environment responsible for it. This suggests the 
following hypothesis: for any Isolable property of the environment there should 
be a corresponding Isolable property of the stationary or flowing optic array, 
however complex . It is this hypothesis, relating to the^'what of peicceptjon, that 
constitutes the concern of ecological optics. If true, this hypothesis has pro- 
found implications for perceptual theory. For if the structured light to the 
visual system can specify the world and if it is only rately equivocal (contrary 
to hundreds of years of studied opinion) , then we can imagine that visual sys- 
tems evolved to be sensitive to its structure, i.e., to the higher-order vari- 
ables of the optic -array. Such being the case,^ visual perception need not be 
constructive — it need not be a guessing game — it could be direct. The latter is 
Gibson's wdrking hypothesis on the how of perception and it may be stated more 
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formally: for every visual perceptual experience, there is k corresponding prop- 
erty of the stationary ' or flowing optic array, however jcomplex (cf. Gibson, 1959, 
1966) . . — f 

The Concept of Invar iance 

«/ ■ . * • . 

What then is the nature of the correspondence between the optic array and 
environmental properties? 

■ ■ ; . ■ • • ^ / ' 

Recall that tradition opted for replication as the form of the correspondence 

between environment and retinal image. One consequence of this has been dealt 
with; namely, that because the thirds dimension is not registered in the retinal 
image, it therefore must be constructed. The upshot of the Gibsonian analysis 
is that the correspondence is correlative (in the sense used by analytic projec-. 
tive geometry and not in the sense used by statistics) and that our search should 
be for correlates of environmental properties, and not for copies. Moreover, one 
should look for these in the ambient optic array rather than in the retinal 
image. Where a property of the static or changing optic array corresponds in 
this sense to a persistent property of the environment, it is referred to as an 
invariant . * 

The Stationary and Flowing Optic Array ^ / 

In the current scheme the natural stimulus for an ocular system is the optic 
array. This contrasts with the view discussed earlier of stimuli as punctate, 
and momentary, because an optic array by definition is extended and persistent. 
Furthermore, stimuli ate traditionally conceived to be at the retina (hence the 
primacy of the retinal image), whereas the optic array clearly is not. Indeed, 
we can mathematically describe the optic array at a position (point of observa- 
tion) in an illuminated environment without assuming that the position is occupied 
by an observer (GiTbson, 1966). 

A significant feature of Gibson's conception of a point of observation is 
that It is stationary only as a limiting case, simply because observation gener- 
ally accompanies movement (I.e., the observer is often in motion). 

A moving point of observation means a changing optic array: comppnent-^ 
visuai(. solid angles will be transformed, some will even cease to exist. But the 
s,urrounding layout of surfaces responsible for the optic array at a stationary 
point of observation must also be responsible for the Slowing optic arr^y at a 
moving point of observation. There ought to be-; abstract optical relations that 
persist during the course of change, and these relations ought to be specific to 
thei^ persisting or permanent properties of the environment. On the other hand, 
the nonpersisting featu3;'es of the optic array are due to the motion of the point 
of observation, that is, ^ to locomotion itself^ Of the changes in the optic 
array, we will say that they specify locomotion, and herein lies a significant 
insight. 

Briefly, the gist of the preceding may be conveyed in this fashion: the in- 
variant strqcture of the changing optic array is Information about the environ-^ 
ment, that is, it is exterospecif ic; the variant structure, on the other hand, is 
information about the observer and therefore may be referred to as propriospecif- 
ic. We proceed from here with the notibn of invariant structure (drawing a com- 
parison between invariant Information in the changing and in the unchanging. 



Stationary M!)ptic Array) leaving to a later moment the egospecific character of 
variant optical structure. > 

\\r- ' : . . ' ] . r ' > ' ^ ■ ' '^ 

Consider a problem encountered earlier, that of perceiving continuous dls- 
^ . tanAe. Froia ,a stationary point pf observation, a recedi,ng textiired surface pro- 
; ; jects a gradient of optical texture density. (By "gradient" is meant nothing 
^ more than an inct^ease prf decrease of some measured quantity along a ^iven dimen- 
sion or axis.) Fpr two identical (but hypothetical) visual solid ^ngles, the 
One corresponding, to^^^^a region of the surface close to the point of .observation 
* will be less densely packed with; intensity transitions (prodifced by the surface's 
. t^jctural /elements sqattering the light) than the one correspdnding to a region 
farther away. And^ of courae, for es^ch Identical visual solid angle sandwiched 
V. between these two the density! of transitions would increase wi^h the distance of 
' the region to which the visual solid angle corresponded. 

' ■ ■, *■ ■■ ^ ' \ ■ ■ " 

In the case of a moving point of observation, we can suppose, that the im- 
* pression of distance is given by a dynamic trans foniiat ion of te:^tute, more pre- 
cisely, by a gradient of flow verocities: the rate \of textural flow for "close 
js^^ by" wouid be greater tli^n that for "far away." In sum, there are correlates of 
\ distance in both the flowing and stationary optic array. The impiession, How- 
V ever, is that the flowing optic a'rray ist less equivocal in its specification of 
viistance, and of surface variables jsuch as slant (cf. ifeLock, 1964) . Bi\]t this is 
'^only an instance of what is a most general rule, namely, %he superiority of 
Changing op tlc^ arrays over stationary optic arrays as information about an envi- 
ronment (Gibson^ 1966) . We should not be surprised by this observation. « After 
all J,, if a stationary point of observation is a limiting case of a moving point 
of observation, then it follows that the structtire of the optic artay at a 
stationary point isVbut a special case of the structure of the optic array at a 
moving po^t. One can contrive stationary, perspectives to f ool ,an observer; to 
fool a moving observer is an incontparably .more -difficult task. The notorious 
Ames room: — ^a distorted room with unpattemed walls and floors — is a case in 
^ point. Limited to a stationary perspective, the observer is fooled, but he de- 
tects the ruse once he Is allowed^ to move. 

It is already understood that the ambient optic array as the natural stimu- 
lation for vlaion has structure: it has some degree of adjacent order and some 
degree of successive order. What we now understand is that, concurrently, it 
has components of change and nonchange. For Gibson the import of change is that 
it reveals what is essential, namely, invariance. We can put this another way: 
structural variation ^reveals' structural nonvariatibn. In this conception, move- 
ment on the part of the organisui safsrek a singularly important geometric func- 
tion. , There is structure that can be detected in a frozen array, but there is 
considerably more strjucture to be detected in a changing array, and the organism 
can effect the change; In short, the movements of the observer transform the 
ambient optic array and thereby enhance the detection of invariants. 

The Visual Perceptual System 

The above reasoning motivates the conception of the mechanism of vision as 
an actively exploring system. Traditionally, however, a sense system is con- 
ceived more conservatively as a well-specified anatomical route that delivers 
sensations to identifiable centers; integration and cross correlation of sense 
data then follow as circumstance demands. But in an ecological perspective, jthe 
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system responsible for seeing ought to be described in ascending levels of 
activity; for example, mobile eye-mobile eyes in mobile head-mobile eyes in a 
mobile head on a mobile body. A, hierarchy of levels of activity constitutes th^ 
visual perceptual system, with each level of activity, from lesser to greater, 
permitting increasingly more complex transformations of the optic array and, con- 
comitantly, the detection of increasingly higher-order invariants. This con- 
ception, of course, does not deny that an eye Is for light, but it does question 
the assumption that seeing Is entirely an act of the brain and can be understood 
solely as such. Seeing, Gibson W9uld say, is an act of the organism. 

Consider If you will what transpires'' when an observejc is confrontej^ with an 
interesting but imperfectly clear object at some dista^ce from her. What does 
she do to increase the clarity of her perception? She could, of course, adjust 
the lenses of the eyes, but an eminently more significant improvement would 
follow upon her walking up to the object. The moral is^ simply. Locomotion is 
not immaterial to vision and an eye is all the better for having legs. 

It is instructive in this context to consider what is meant by visual in- 
formation processing. From the traditional interpretation of vision as a sense 
system, the meaning is clear: there are signals emanating from the environment 
yielding neural signals to be interpreted. But for Gibson, processing cannot 
mean this. Quite to the contrary, processing resides in the activities of the 
perceptual system, in its adjusting, orienting, exploring, an4 optimizing of in- 
formation about the environment that is external to the perceiver ("Gibson, 1973). 
On this view, processing is not in the brain as such but pervades the continual: 
cybernetic loop of afference-efferende-reaf Terence that defines the synergistic 
relation between the perceiver and his ecology. • ' . , 

V"- 

^ Equivalence of Perceptual Systems 

V 

Lejt us now return to the characteristics of the optic array in order that 
we may touch upon a further aspect of the concept of invariance with respect to 
the notion of a perceptual system. We commented above that the optic array as 
the natural stimulation for the visual perceptual system consisted of order and 
change of order. What we now recognize Is that these characteristics are not 
modality specific and may be taken as -intrinsic to stimulation for any perceptual 
system. This being the case, one may conjecture that the same structyre, 'and 
thus the same environmental event, can be equally available to moire than one 
perceptual system. For example, the pattern of discoiTtinuity formed by rubbing, 
or scraping the skin with an object might have the same abstract description as 
the pattern of acoUstic discontinuity in the air determTned by rubbing and 
scraping the same object against other objedts.. 

Experiments into seeing with the skin speak to this very issue (Wliite, 
Saunders, Scadden, Bach-Y-Rlta, and Collins, 1970). The "observer" in these ex- 
periments Is attached to a system that uses^ television camera as its eye, 
where the camera is quite mobile and can be aimed variously by the observer.' 
The video image ±& electronically transferred ^nd delivered to a ^0 x 20 matrix 
of solenoid vibrators that stimulate an approximately 10- in. -square area on ,the 
observer's back. Each solenoid vibrates when its locus is within an illumlna^ted 
region of the camera field; tHus the matrix as a whole yields a pattern of dis- 
continuity on the skin that corresponds essentially to the arrangement of in- 
tensity transitions in the optic array. ^ 




Of the many intriguing results reported, the following are the most signifi- 
cant to our concerns. First, properties of the environment are specified in the 
tactile array. Second, there is no qualitative difference between blind observ- 
ers and blindfolded normal seeing observers; nor are there any quantitative dif- 
ferences of any g^eat significance. Both blind and sighted (but blindfolded for 
testing) observers learn to detect the environmental properties specified in the 
tactile array with virtually equal facility. Third, the tactile array specifies 
the layout of surfaces and the whereabouts and identity of objects significantly 
better when it is changing than when it is static. In brief, the observer's 
ability to detect environmental facts is exihanced when he can move the camera, 
that is, when he can transform the tactile array. As with the optic array, varia- 
tion in the tactile array reveals nonvariation; it permits the isolation of what 
is essential from what is not. 

That a changing tactile array can yield information about ^he three-dimen- 
sional structure of an environment — more precisely, the surface layout — and that 
its pickup does not require prodigious training (see White et al., 1970) is most 
favorable to Gibson's position and embarrassing to the traditional one. The 
fact that the tactile array at a moving point of observation contains the same 
adjacent and successive order — 'the same abstract mathematical structure — as is 
contained in the transforming optic array means that the same invariants are 
there to be detected. On this account, the tactile stimulation does not have to 
be cross-correlated with information detected by other perceptual systems, or 
compared with memorial information, in order to be rationalized — to the contrary, 
the tactile arrangement affords the same meaning as the optical arrangement. 

At this juncture it is appropriate to draw a further contrast between the 
cons true tivist perspective, in particular the information-processing approach, 
ai;id the Gibsonian. The emphasis on transforming arrays in the preceding discus- 
sion implies that perception is easier, income sense of the word, when the input 
variation is greater. But common to a great deal* of information-processing 
modeling is the principle that the nervous system is severely limited in ita ;^n- 
fdrmat^.on-handling capabilities; the implicit assumption is that the fewer the 
sensory events to be handled per unit of time, the more proficient the nervous 
system:^is as an information-processing device. The contrast between t^e two per- 
speeti^^es, the information-processing/consttuctivist and the Gibsonian, is high- 
lighted in the following comments by Wl^te et al. (1970:27): 

Visual perception thrives when it is flooded with information, when 
there is a whole page of prose before the eye, or a whole image of 
the environment; it falters when the input is diminished, when it is 
forced to read one word at a time, or when it must look at the world 
through a mailing tube.... The perceptual systems of living organ- 
isms are the most remarkable information-reduction machines known. 
They are not seriously embarrassed in situations where an enotmous 
proportion of the input must be filtered out or ignored, but they are 
invariably handicapped when the input is dramatically curtailed or 
artificially encoded. Some of the controversy about the necessity of 
preprocessing sensory information stems from disappointment in the 
rates at which human be-ings can cope with discrete sensory events. 
It is possible that such evidence of overload reflects more of an in- 
appropriate display than a limitation of the perceiver. 



39 

40 



ft 



Let us consider one further example of the equivalence of perceptu^ sys- 
tems. Consider an object on a collision course with an obsexver. Optical con- 
comitants of this event consist, In part, of a symmetrically expanding- radial 
flow field klnetlcally defined over the texture bounded by the object's contours 
and, simultaneously, the occlusion and dlsoccluslon of texture at the contour 
edges. If this mathematically complex change or some reasonable facsimile of It 
Is simulated on a screen, the observer will Involuntarily duck or dodge (S^Jtiiff, 
Cavlness, and Gibson, 1962; Schlff, 1965). And he will^do so whether human 
(adult or Infant) or animal (e.g., monkey, chicken, crab). 

For simplicity, we may say that the acceleration j,n symmettlcal expansion, 
referred to as "looming" (Schlff et al. , 1962) , specif les Impending collision. 
Now It Is the case that the participants In the seelng-wlth-the-skln experiments 
would often give startled ducks of the head when the pai*t of the tactile array 
corresponding to an object was suddenly magnified by a quick turn of the ^oom 
lever on the camera (White et al., 1970). One anticipates the Glbsonlan Inter- 
pretation of this curious fact: the mathematical description of the changing 
optic array and of the changing tactile array corresponding to looming are 
identical. That is to say, that even though there are obvious vqufalit^tive dif- 
ferences between visual and tactual ekperiei^ce, both perceptual 'systems detect 
the same information specifying the event of looming. 

And consider now the acoustic array as structured by d car on a collision 
course with you. Again <^e will have to entertain the hypothesis that the mathe- 
matical description of this looming is the same as that for vision and touch. 
In sum, perceptual systems as different Inodes of attention sample the same world, 
and when the same event is detected, it is because of an abstract identity in 
the available structure to which each system is sensitive. Looming as an envi- 
ronmental event structures the optic array, the tactile array, and the acoustic 
array in the same fashion. 

# 

But what theii becomes of the notion of sensation in this scheme of things? 
For Gibson (e.g., 1966) , sensation is largely divorced from perception, in con- 
trast to the more traditional orientation that classifies sensation as the neces- 
sary precursor to perceptual experience. Gibson's stance on this point is very 
similar to that of Heider (1959), who distinguishes between thing (message?) and 
medium. The human perceiver may attend to either. If he attends to the way in 
which th^ medium is structured by the environment, then he perceives; if, on the 
other hand, he attends to the medium Itself, then he is said to have sensations. 
Given our current understanding, of Infant vision (cf. Bower, 1971), which sug- 
gests that infants see objects in definite ways and at determinate distances, we 
might be led to conjecture that attending to the way in which light is structured 
by the environment — that is, attending to the message — is ontogenetlcally prior 
to the ability to attend to the medium Itself. On this conjecture, sensitivity 
to the medium and interpreting the medium reflect a latter-day sophistication, 
i whereas at the outset the visual perceptual system is geared to detecting "mes- 
sages." ' ' 

The reader may be of the opinion that I am belaboring a somewhat arbitrary 
and even meaningless distinction. Or, more simply, perhaps a discussion of the 
sensation/perceptlop^^lstlnctlon seems out of place in a treatment of current 
thought on vlsual/percaptlon. For myself, however, I am convinced that much of 
current theory ajcknowlG^ges implicitly the view that sensation is prior to 
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perception. As hinted earlier, constructivism has departed little from the 
structuralist notion of discrete sensations as the ,sum of what is directly 
accessible. Fundamentally, all that has been done is to substitute the term 
"feature" for "sensation." And whether antiquated sensations or modem features 
are taken as directly given (with all else as inference or construction) Ryle's 
(1949) prisoner-in-the-cell parody is'still fashionable (cf. Turvey, 1974). As 
the parody goes, a prisoner has been held in solitary confinement since birth. 
Hia^^ll is devoid of windows but cracks in the walls provide occasional flick- 
ers of light, and through the stones occasional scratchings and ta^ppings can be 
heard. On the basis of these snippets of light and sound, our hapless prisoner 
apprehends the unobserved happenings and scenes outside his cell such as foot- 
ball games, 'the Miss America Pageant, and audiences at the World Congress on ^ 
Dyslexia. But we should ask, as Ryle does, how could our prisoner ever come to 
know anything abouij, say, football games, except by haying perceived one in the 
first place? ' 

Information abcfut One's Self * f 

We recall the conclusion drawn about the flowing optic array at a moving 
point of observation: the change is propriospecijf ic, in contrast to the non- 
change, which is extiarospecific. Having delibera^tJed on the contrast between per- 
ceptual systems and information about the environment, the one hand, and 
sense systems and sensations on the other, we are now i\n a better position to 
appreciate the implications of the egospecific nature of vision. 

On the classical Sherringtonian view (Sherrington, 1906) , each sensfe or re- 
ceptor system performs a unique function — it is said to be exteroceptive, pro- 
prioceptive, or interoceptive. This view wag buttressed by an older and more 
sacred doctrine, namely, Johannes Miiller's law of specific nerve energies, which 
holds tlj^t sensation is specific to the receptor initiating it. MClller'^ doc- 
trine argues that one's awareness is of the state of the nerves; thus, by impli- 
cation, an awareness of movements of the self Is an awareness of jihe states of a 
specialized receptor system. Tradition referred to this awareness as propriocep- 
tion/kinesthesis and localized the specialized receptor system in the muscles and 
joints. 

' Gibson's rebuttal to these conceptions is by now quite evident. If by 
exteroceptive we mean detecting information about events extrinsic to the organ- 
ism and if by proprioceptive we mean detecting information about the animal's 
oim bodily activities, then clearly vision is as capable of the latter as it is 
of the former. The classical dichotomy of" exteroceptive and proprioceptive sys- 
tems is wrong. Thus we may speak meaningful!^ of "visual proprioception" in 
reference to the visual perception of bodily movement (Gibson, 1966; Lishiuan and 
Lee, 1973; Lee and Aronson, 1974) to emphasize that proprioception is not the 
prerogative of a specialized receptor system or sense organ. Indeed, one can 
make an argument that visual proprioception is the more reliable, and often the 
only reliable, source of information about movements of the self (egomovement) . 
For example, muscle- joint kinesthesis Is uninf ormative when I am traversing an 
environment by car or by train. But the flox^ing optic array would appear to 
specify locomotion of the observer, be he passive* or active. And what of the 
fish swimming upstream or the j^ird flying against a head wind? Muscle-joint 
kinesthesis would specify movement in the sense of change of location where there 
is/ none . 
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For a locomoting observer the flow pattern ought to provide information for 
getting about In an environment, that is, for guiding or steering one's move- 
ments. The principal feature of the flowing optic array concomitant to locomo- 
tion is that it is a total transformation of the projected environment. The 
array expands ahead and contracts behind. The projection of the place in the 
environment to which the observer is heading is the focus of expansion in the 
ambient optic array, wkile the focus of contraction is the direction from which 
the observer has just come. Thus* there is information about the direction of 
movement, and to steer toward a particular object means essentially to keep that 
object — as a closed internally textured contour in the optic array — as the' focus 
of optical expansion (Gibson, 1958; Johnston, White, and Gumming, 1973). In a 
detailed mathematical analysis of the transforming optic array at a moving point 
of observation, Lee (1974) develops a proof of the availability of body-scaled 
information about the size and shape of potential obstacles to locomotion. 
Related!^ the claim can be made that there is body-scaled information relevant 
to the amob»t of thrust needed to leap over an obstacle and to the time at which 
the thrust is t;9 be applied. In sum, for an animal capable of registering the 
optical velocity field and its derivative prpperties, there is directly specific 
able optical information for the guidance of locomotion in a cluttered environ- 
ment. 

But do I need to view the focus of expansion in order to know in which di- > 
rection I am moving? We know, of course, that we can perceive our direction of 
movement when the head i^s turned sideways. Is it possible therefore that samples 
of the total optical flow pattern, samples that do not contain the focus of ex- 
pansion, can specify the direction of one's motion? The answer is provided 
through the use of computer-generated motion pictures depicting various flow 
patterns for egomotion over an endless textured plain (Warren, 1975). Samples 
of optical flow patterns can be generated that give specific impressionfl ofmove- 
ment toward specific locations not contained in the samples. Aside from high- 
lighting the broad egomovement specificity of the changing optic array, a par- 
ticularly significant feature of this demonstration is that a sample of the total 
ambient array can specify the total ambient array. Or, to put it somewhat dif- 
ferently, the flow pattern of the field of view specifies characteristics of the / 
ambient optic array outside the field of view. On this point. Warren (1975) 
reminds us of Merleau-Ponty's (1947:277), claim that: "We see as far as our hold 
on things extends, far beyond the zone of clear vision, and even behind us. When 
we reach the limits of the visual field we do not pass from vision to non- 
vision. ..." 

The proprioceptive role played by the visual perceptual^rsystem enjoys fur- 
ther elaboration in the demonstration that the human infant's ability to main- 
tain a stable posture is very much under visual control. More generally, it is 
believed that the principal information about body sway is broadcast by receptors 
in the vestibular canals and, in the muscle/joint complexes — mechanorjec6ptors, as 
they are often called. However, it can be shown that balance is perturbed by 
transformations of the ambient optic array in a direction that is specific to the 
transformations (Lee and Aronson, 1974). From the foregoing account, an expand- 
ing optic array specifies forward egomovement, a contracting optic array speci- 
fies backward egomovement. An infant standing on a stationary floor in a room 
can be caused to sway forward or fall forward when the walls and ceiling of the 
room move away from him (i.e., contraction of the optic array) and to sway hack- 
ward or fall backward when the "room" moves toward him (expansion of the optic 
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array). Though not as pronounced, the same relation between standing and optical 
change can be observed in adults (see Lee and Aronson, 1974). 



Object-^ Structure Correlates in the Optic Array and Fourier Analysis 

To a very large extent the preceding sections have assumed a "frozen" envi- 
ronment. Our analysis has concentrated on the stationary and the moving point of 
observation in respect to an illuminated environment, which was described 
tersely as a fixed arrangement of textured surfaces. Of course, Implicitly it 
was assumed that the environment was cluttered with objects — themselves arrange- 
ments of surfaces — but little comment was made about them or about their activi^ 
tiea, with the exception of a passing reference to "looming." In the sections 
that follow, this omission is partially remedied as we direct our attention, in 
measured steps, to objects and the changes they undergo. In what follows, we 
unfreeze the environment. 

But first a brief consideration of the information about a nontransf orming 
object at a stationary point of observation is in order. (The object in mind is 
^ one that is detachable from the; ground plane: it is movable.) Expressed roughly: 
an object's concomitant in the, bptic array is a visual solid angle, corresponding 
to the exposed face(s) of the object, packed with smaller (and nested) visuaX 
solid angles, corresponding to the facets or grain of the exposed face(s). One 
might say for simplicity that the concomitant is a closed contour with internal 
texture. • 

We can raise two elementary but instructive questions. First, what kind of 
ordered discontinuity in the arrangement of ambient light separates the visual 
solid angle corresponding to an object from the other visual solid angles in 
which it is nested? More precisely, what is the mathematical invariant for con- 
tour? Second, what kind of ordered discontinuity in the optic array is specific 
to the type of articulation between surfaces comprising an object? More bluntly, 
what is the mathematical invariant for edge type? 

We are reminded that both of these questions were raised earlier in a dif- 
ferent way and in a different context, that of seeing machines, 'where the concern 
is with line drawings of variously arranged polyhedral bodies and how a three- 
dimensional description of them can be recovered. It is important in that con- 
nection to devise heuristics for separating bodies from bodies and for identify- 
ing the quality of the joint between surfaces. In the current context we are 
assuming a natural environment, as opposed to a line drawing, and ecological 
optics, as opposed to image optics. Our current intuition is that the light to 
an eye should specify unequivocally the presence of contours and the type of 
edge. The following is meant only as a hint of how this intuition might be sub- 
• -v stantiated. 



Solid objects are composed of textured surfaces whose gradients relative to 
some stationary point of observation are correlates of their shape and slant. 
For example, as the slant of a surface varies in continuous or stepwise fashion, 
then the projected gradient ought to do likewise (Gibson, 1950, 1966). In the 
instance of an object-surface bending or of two surfaces of a polyhedral object 
joining, the abrupt change in slant will yield an abrupt change in the gradients 
o^optical texture density. Precisely, a concave bend or joint (edge) is 
specified by an abrupt transition from one rate of change in intensity transi- 
tions to a slower rarte; a transition to a faster rate specifies a convex bend or 
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join. These mathematical discontinuities exemplify the form of invariant for 
edge type. An invariant for contour can be a more simple mathematical discon- 
tinuity — a sudden transition in optical textuije density* 

The above is but a rough account of the optical specification of edge and 
contour In a frozen array (for a fuller account dealing with transforming arrays, 
see Mace, 1974). It is instructive though in the following senses; first, it 
highlights the principle that the optical structure uniquely specifying environ- 
mental properties is "...defined not over parameters of radiant light rays such 
aa intensity, but over relations of change in intensity" (Mace, 1974:144); 
second, it expresses the mathematically abstract* nature of Invariants — an edge 
type is given by a particular change in rates of change; third, it strongly 
implies that the perception of object relations in a scene is predicated upon 
optical variables and mechanisms quite different from those currently envisioned 
by artificial intelligence research. 

The last point is well worth developing. In most scene analysis programs 
there has to be a method for mapping linefs^in the picture domain onto edges ±ji 
the scene domain. But in the Gibsonian view, lines are not primitive entities 
from which all else is constructed. The organism is confronted by mathematical 
relations in the light, relations across changes in intensity transitions, which 
are specific to edge types. If he can pick up on these higher-order relations, 
he has no. need for lines nor for line-to-edge mapping rules. 

Mackworth (1973) has described a computer program for interpreting line 
draiJlngs of polyhedral scenes that does not rely on knowledge of specific object 
prototypes (cf. Falk, 1972) nor does it appeal to the technique of mapping 
vertices onto comers (cf. Clowes, 1971). The kernel feature of Mackworth* s pro- 
gram is the mapping of surfaces Into a gradient apace and the use of this repre- 
sentation to determine (through the use of "coherence" rules that articulating 
surfaces must satisfy) the properties of the scene. Of course, Mackworth* s pro- 
gram is designed for line drawings (ite», textureless regions) and therefore re- 
quires the use of heuristics to determine the pattern of surface orientations. 
Bup one is motivated by this program to ask how computationally more straight- 
forward scene analysis programs would be if they worked on natural scenes instead 
of line drawings, and were equipped with the means to determine density and dis- 
continuities of texture. For a long time Gibson has argued that picture percep- 
tion is not a simpler version of natural scene perception. His suspicion, Is 
t^at, on the contrary, an account of the latter is more readily forthcoming than 
an account of the former. 

The tenor of the Immediately preceding comments suggests that it would be 
useful to grasp, if only In a rough and approximate way, the kind of mechanism 
permltting> the detection of texturai variables* Understandably, little effort 
has been airected to the determination of how surface structure is perceived 
becausc^'^d'st investigators interested in the how of perception are firmly en- 
trenched in retinal image optics. Consequently, they have been satisfied with 
inquiries into two-dimensional outline forms and features. However, research in 
computer vision has not been so remiss (see Hawkins, 1970) and one recent report 
is especially Interesting: Bajcsy (1973) points to the special advantages of" 
surface texture descriptions derived in the Fourier domain. Why this should be 
thought interesting becomes evident when one considers a rather peculiar version 
of the cross-adaptation experiment described earlier. 
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, Consider' a gratiag of vertical dark bars of equal width on a light back- 
ground such that the plot of luminance against spatial location yields a regular- 
ly repeating square wave. Now consider a further grating in which the relation 
between bars and background yields a sinusoidal plot of luminance against spatial 
position. Most clearly we could generate an Indefinitely large number of grat- 
ings yielding square waves and sinusoids of different spatial frequencies as 
measured in cycles p6r degree of visual angle. The use of gratings such as these 
in cross-adaptation (and related) experiments ^suggests that the visual system is 
selectively sensitive to spatial frequency — that it performs a spatial frequency 
analysis (Comsweet^ 1970; Sekuler, 1974). 

A stronger inference, but one that is not dictated by the data (see Sekuler, 
1974), reads as follows: there is a set of independent channel^ each sensitive 
to a particular spatial frequency band whose overall behavior may be character- 
ized as that of iperfoming an analysis into Fourier components (e.g.. Pollen, 
Lee, and Taylor » 1971). By way of illustration, Blakemore and Campbell (1969) 
demonstrated that adapting to a high-contrast square wave grating of frequency F 
raised the threshold for sinusoidal gratings of frequency F and (to a lesser 
degree) sinusoidal gratings of frequency 3F. If the visual system is performing 
a Fourier analysis , then the square wave grating ought to be "decomposed" into 
its fundamental and odd hamonics. And, consequently, there ought to be a change 
in the sensitivity of the system td sinusoidal gratings at the fundamental and 
odd harmonic frequencies. In this respect, the Blakemore and Campbell finding 
fits the Fourier model. A further fit is a demonstration by Maffei and 
Fiorentini (1972) of Fourier synthesis: a perceptual impression of a squate wave 
grating of frequency F results when two sinusoidal gratings, one of frequency F 
and^one of frequency 3F, are presented separately, one to one eye and one to the 
other. 

Though discoveries like the above have aroused considerable interest and 
are proliferating at an enormous rate ^see Sekulerj 1974), incites tigators have 
yet to concern themselves seriously with the role played by frequency analysis 
in visual perception. But surely Gibson* s (1961, 1966) ecological analysis and 
BajcsV's (1973) insight suggest the role: natural environments consist of sur- 
faces; natural surfaces are textured, in the sense of repeating patterns of 
varying spatial frequei^cy; and the detection of surface properties, static and 
kinetic, is mandatory for a being that must adapt to its world. 

Event Perception: Structural and Transformational Invariants 

Let us now come to an examination of objects undergoing change. The exam- 
illation remains in relation to a stationary point of observation. 

Shaw et al. (1974:279) have remarked: "The environment of any organism is 
in dynamic proces^^ sudh that the smallest significt^nt unit of ecological analy- 
sis must be an event rather than a simple stimulus, object, relation, geometric 
configuration or any other construct whose essence can be captured in static 
terms." "Event" is the term most befitting a change in an arrangement of, sur- 
faces; and rolling, opening, falling, rotating, and aging are examples of events. 
Nonchange is but a special case of change; consequently, resting, standing, being 
supported, etc., are legitimate events. However, the concept of an event cannot 
be meaningfully construed in the absence of a subject; thus events ajfe more 
accurately exemplified by: "the ball rolls," "the door opens," "the rock falls," 
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and "Bill grows old." The last particularly instructive because it reminds 
us that the events we perceive can be of varying duration — the "slow" event of 
Bill's aging contrasts with the "fast" event of rock^ falliijig. 

In concert with the fundamental hypothesis of ecological optics, the light 
is structured by an event in a fashion specific to that event. An event is said 
to modulate the light in a way that specifies the identity of the participant in 
the event and the dynamic component of the event, the form of the change. Con- 
sequently, for event perception it is conjectured that the visual perceptual 
system must detect the structural invariant specifying the structure undergoing 
the change and the transformational invariant specifying the nature of the change 
undergone (Shaw and Wilson, in press). 

Let me proceed to elaborate on these conjectures. A demonstration by Gibson 
and Gibson (1957) provides an appropriate example. JThe observer viewed a screen 
onto which was projected the shadow of one of the following; a regular form (a 
solid square), an irregular form (ameboid shape), a regular pattern (a square 
• group of dark squares), an irregular pattern, (an apieboid group of ameboid dark 
shapes or spots). Each silhouette was semirotated cyclically and the observer 
had to report oqr- the degree of change in slant. For both the -regular and irregu- 
lar silhouettes, the observer's experience was of a rigid xionstant surface 
changing slant and he was able to judge the slant mast accurately. The implica- 
tion of this simple demonstration is quite paradoxical as Gibson and Gibson 
(1957) realized: a change of form yields a constant form together with a change, 
of slant. The paradox originates, in part, in the two uses of the term "form." 
When we speak of "change of for^llf' we are referring to the abstract geometrical 
projected form, or silhoustte; when we refer to "constant form," we are speaking 
about the rigid substantial form. The traditional way to interpret kinetic depth 
effects, such as that just descxMbed, is to say that the currently perceived 
static and flat projected form is combined with the memories of the preceding 
static and flat projected forms to yield, through an act of construction, the 
substantial form-in-depth (cf. Wallach and O'Connell, 1953)." ^And it is evident 
that this interpretation is motivated by the two assumptions of (1) stimuli as 
momentary frozeil slices in time and (2) retinal image optics as the proper de- 
parture point for speculation. Quite to the contrary is tile GibsOnian interpre- 
tation that follows from ecological optics with its notion of an enduring ambient 
optic array: a lawful transformation of the optic array specifies both an un- 
chanjging rigid object^nd its motior^. The emphasis here is on the event's modu- 
lation of the structure of the ambient light. 

The implication of this interpretation for the understanding of object-shape 
perception is quite Radical and exceedingly difficult to graap on initial read- 
ing. In any event let me take the liberty of spelling it oui relying on subse- 
quent discussion for its clar if location. The shape of^an object is not given by 
a set of frozen perspectives but by a unique transformation of the optic array. 
Or, synonymous, object-shape perception is not based on the perception of static 
forms but on the perception of formless invariants detected over time (cf. Shaw 
et al., 1974). This hypothesis denies that shape can be captured by the formal 
descriptions of classical geometry; that static forms are unique entities (to the 
contrary, a f^rm is not a thing but a variable of a thing); that static forms are 
primary sources of data for perception; and that shape is isolable physical 
property of objects. On the latter point we may comment that shape is perhaps 
better conceived as a proi^ty of events (Shaw et al. , 1974; Shaw and Wilson, 
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In press). Before further examlnation^of this hypothesis let us consider an 
example of a transformational invariant. 

The sensibilitjr of the observer to the form of change in a changing optic 
array is demonstrated in a singularly elegant fashion by von Fieandt and Gibson 
(1959). Consider the contrast between rigid and nonrigid (elastic) motions. An 
observer peers at a screen onto which i^ projected the shadow of an irregular 
elastic fishnet. The fishnet is attached to a frame that is inanipulable in the 
followAig ways; one end of the frame can be slid in and o^t, compressing and 
stretching the shadow (an elastic transform), or the whole frame can be semi- 
rotated "back and forth subjecting the shadow to foreshortening and its inverse j 
(a rigid translation) . All the observer witnesses on the screen is the changing 
textured shadow, that is, he does not see th6 shadow of the frame, only the 
changing motion of the textural elements, whiol^ are. virtually quantitatively 
identical for the two transformatiotis. Yet the observer readJ^ly perceives a 
distinct elastic (topological) transformation, in the one case, and a distinct 
rigid transformation, in the other. In short, he is sensitive to the quality of 
change, i.e., to the transformational invariants in the ambient optic array that 
specify elastic and rigid happenings. V 

Now the e^tplicit and implicit strands of the currently developing thesis 
can be wpven together. At a stationary point of observation, a detached object, 
as a surface arrangement, has specifiable structural concomitants in the ambient 
optic array. If the object is caused to move, or if it is squashed, or If it 
grows, or if it disintegrates, the ambient optic array will be transformed; that 
is to say, its structure will be disturbed. The onset and offset. of this trans- 
formation or disturbance will be identical to the beginning and termination of 
the event and since the duration 'of ecologically significant events ranges from 
milliseconds to scores of years, so it is with the event-related. optical dis- 
turbance. Though the form of the disturbance does not copy the event, it does 
correspond mathematically to the e^/ent and is said to do so in two ways: it 
corresponds to the kind of change (transformational iiavariant) and to the struc- 
ture whose identity persists in the course of change (structural invariant). 
Visual perceptual systems are assumed to sense (detect) these invariants. 

Objects that participate in events are said to have shape, but it is argued 
here that shape is more accurately a property of the events into which the ob- 
jects enter as participants than of the objects themselves. Shape, we have said, 
is a formless invariant over time and to this proposition we now turn. 

Symmetry Groups and Shape Perception • ^ . 

Intuitively a circle and square both have symmetry, but they do not have 
the same symmetry. How igilght we describe the idea of symmetry so that we can 
capture this difference? The mathematician's answer is that sjrmmetry is con- 
cerned with certain rigid mappings (referred to as;.,symmetry operations or auto- 
morphisms) that leave sets of points unchanged. Thus mathematically any ro^tion 
in the plane about its center leaves the circle, as a et^t of points, invariant. 
On the other hand, only certain rotations (i.e., the integer multiples of v/2) 
map the square to itself. Evidently the properties of cirdle and square could be 
revealed and contrasted in the syzmnetry operations that leave them unchanged. 
Moreover, we can recdgnize the synonymity of the concepts of symmetry and invari- 
ance: paraphrasing Weyl (1952), a thing is symmetrical if there is anything we 
can do to it so that lifter we have done it, it appears the same as it did before 
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(cf, Feyiunann, 1967; Shaw et al., ;L974). For the mffthematician, those symmetry 
operations that leave an object unchangjed form a group, and this group describes 
exactly the symmetry (invariance) possessed by the object/ Though we cannot draw 
an exact parallel between the mathematician's notion of symmetry arid. Gibson's 
notion of invariant, we can use the mathematician's insights as a guideline. * 
Ideally, the reader will find this particular guideline useful for understanding 
the ideas expressed in the last section. 

To illustrate, consider the symmetry of a square. There are eight opera- 
tions that leave the square invariant (i.e., inap the square into Itself): the 
null operation (no change), the clockwise rotations through ir/2, it, and 311/2; and 
the reflections in the horizontal axis, vertical axis, and the two diagonal axes. 
Now the composition of any two symmetry operations is itself a syrametjy operation, 
thus reflection in the horizontal,, axis and the clockwise rotation by it is the 
same as reflecting, in the vertical axis. Thus a table can be constructed of the 
composition of each symmetry operation with each other symmetry operation. This 
table, or more accurately the mathematical structure of this table, describes the 
symmetry of the square, and it has certain interesting properties possessed by 
all symmetry tables. 

If ^ and h are sipametry operations, then £ b^ is a symmetry ' 
operation, where ,£^s some principle of combination. This is 
the property of closure. 

If £, b^, and £ are symmetry operations, then ( a^ £ b^) £ £ = 
££ ( £ £) • This is the property of associativity. 

There is a symmetry operation £ such that £££ "£"£££• - 
Here £ is the identity symmetry operation. 

Given any symmetry operation ^, we can always find a S3mmietry 
operation b_ such that £ £ b^ = £• For every operation there is 
aii inverse. 

^ - 

The&e. four properties define what is referred to mathematically as a 
group. ^ . ^ > ^ 

We can now return to our hypothesis on shape perception, remarking after 
Shaw e^ al. (1974) that rigid shape is the structural invariant of an event whose 
transfprmational invariant is formally equivalent to a symmetry group, that of 
rotations. A partial substantiation of this notion is given in a demonstration 
in which a wire cube is rotated at constant speed on each of its axes of symmetry 
and strobed at appropriate rates (Shaw et al., 1974). When rotated on a face, 
self -congruence is achieved every 9(J®; when rotated on a vertex, positions of 
self-congruence occur every 120®. It follows that the group description of the 
cube's symmetry will differ as a function of the axis of rotation. 

For rotation on a face, the period of symmetry is four; for rotation on a 
vertex, the symmetry period is three. In the demonstration, the strobe rate is 
either synchronous, or asynchronous with the period of symmetry (that is, the 
strobe rate is either an integer multiple or not an integer firultiple of the 
symmetry period). When^ the strobe rate is synchronized with the symmetry period, 
a cube is seen, when the strobe ^ate is not synchronized with the symmetry 

\ ^% 
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period, the shape of th6 objept is no longer recognizable as "cube/' And sig- 
nificantly a strobe rate that permits the perception of cube shape when the cube 
is rotating, on a face does not permit that perception when rotation is on a 
vertex, and vice versa • 

. This demons t rat ioff is significant in thfe following respects. First, it is 
evident that shape is not specified by any arbitrarily choseti set of variant 
perspectives. On the contrary, it appears that rbtational s3mfflietry must be pre- 
served in the variant perspectives projected to the eye. Asynchronous strobing 
annihilates this symmetry and results in a set of .ordered perspectives that 
specify some other shape. In short, different successive orderings of perspec- 
tives specify different shapes of the same physical object and we can now under- 
stand what Xt means to say that shape is a property of events rather than of 
objects. Second, it is clear that not^all perspective^ are needed to specify 
shape, only a special ordered subset of those perspectives. There is reason to 

' believe that the "special ordered subset" may be well-defined mathematically; 
Shaw and Wilson (in press) conjecture that it is the generator set of the group. 
All in all, there is some justification for Gibson's (1950, 1966) claim that 
shape perception is based on the perception of formless invariants detected over 

'time. 

We can further understand the conception of formless invariants by consider- 
ing an event which, quflfe unlike the rotating cube, can be said to transpire 
slowly. 

Of some considerable .significance to our everyday living is the ability to 
determine the relative age-level of faces. - All things considered, we manifest 
this, ability rather well. In the light of our current discussion of events we 
might venture to say that this ability reflects our sensitivity to a particular 
class of S3mmietry operations — those that characterise aging. For most assuredly 
when we speak of the face aging we are speaking of a remodeling of the head that 
leaves invariant the structural information specifying the species and the more 
specialized structural information affording recognition of the individual. 

_ ■ ■ ' ' ^ 

The biologist D'Arcy Thompson (1917) had^ the insight that transformations of 
a system of spatial coordinates permit the characterization of the remodeling of 
plants and animals by evolution. The particular advantage of the method of co- 
ordinate transformation lies in the fact that one can uncover the appropriate 
symmetry operations in the absence of a complete mathematical description of, the 
object to be transformed. Pittenger and Shaw (in press) applied this insight to 
aging. They inscribed a profile in a space of two coordinates and then subjected 
the coordinate space to the af fine transformations of strain (a transformation 
that maps a square into a rectangle)' and shear (a transformation that maps a 
square into a rhomboid). These symmetry operations — for with apptopriately 
chosen parameters, they preserve the identity of the individual and thei species — 
suitably described the transformational invariant of aging: the original profile 
could be mapped into younger and older versions whose relative age levels could 
be accurately ranked by observers. Of the two transformations, strain was the 
more significant. 

The argument emerges that the perception of aging is not necessarily bksed 
on the discrimination of local features but rather on the perception of global 
Invariant information of a higher order that is detected over time. Indicative \ 
of the higher-order nature or.abstractness of this information is the fact that ^\ 



when ,the strain transformation Is applied to the profile of an Inanimate object, 
such as a. Volkswagen "Beetle/' It generates a family of profiles that can be rank 
ordered for age commensurate with the rank ordering of faces (Shaw and Plttenger, 
In press). The Implication Is that the transformational Invariant for aging Is 
Independent of the features common to all animate things^ that grow — just as It Is 
most obviously Independent of the features of any single face, or of the^ features 
common to all faces. 

Indirect and Direct Perception: A Brief Comparison and Summary 

The purple of this paper was to' provide an overview of perceptual theory 
as It relates to vision, together with a liberal sprinkling of empirical facts. 
To this purpose I have dealt at length with what I take to be the two major and 
contrasting perspectives: that visual perception is Indirect and a derivative of 
conception; that visual perception is direct and Independent of conception. The 
focus of this final statement is that the distinction between the two perspec- 
tives turns on the issue of what order of physical space is the proper basis for 
the theory of visual perception. 

In the main, conjectures on the nature of visual experience derive from a 
framework that takes as its departure point a bidimensional description of the 
environment and that seeks to explain how descriptions in three and four dimen- 
sions are achieved through mental elaboration. A consensus is that these higher- 
order accounts are the result of inference and memory, and it is evident how the 
doctrine of a 2-space world as the sum of what is given dictates an attitude of 
visual perception as constructed. We have remarked earlier on the centuries old 
hegemony of this doctrine (see Pastore, 1971). 

In opposing the official doctrine, the view of perception as primary takes 
kinetic* events rather than static two-dimensional Images as the proper point of 
departure (see Gibson, 1966; Johansson, 1974; Shaw et al. , 1974). More precise- 
ly, it is argued that the theory of seeing ought to be anchored in four-dimen- 
^ sional space rather than in the two-dimensional space favorecj by tradition. We 
'>^may suppose that for naturally mobile animals and primitive man the instances of 
pure, static perception are rare. Moreover, "The structuring of light by arti- 
fice" (Gibson, 1966:224) — the representation of environments and events by 
picture — is a relatively recent additipn toman's ecology. The argument, in 
principle, is that the variety of visual perceptual systems evolved by nature are 
Incomparably better suited to the transforming^ optic array — to the mathematically 
abstract, optical concomitants of three-dimensional structures transformed over 
tlme^ — than to the static two-dimensional pattern. 

On this argument there should be several consequences of reducing the dimen- 
sionality of the space in which the visual perceptual system. operates. First, 
the simi total of environmental properties mapped by the ambient optic array is 
reduced. Second, there is a non trivial change in the class of Invariants — spaces 
of fewer diraiensions are accompanied by invariants of a lower order. Third, per- 
ceptual equivocality relates inversely to order of invariant; thus, given the ^ 
second consequence, equivocal perceptions are more likely when the visual system 
^operates in spaces of fewer dimensions* As one might suspect, these consequences 
have Implications for learning. 

It is quit^ obvious that in both perspectives one must learn to perceive. 
But the kind of learning implied by the traditional perspective differs radically 
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from that which is implied by the Gibsonian perspective. In the foraer case, one 
must know in advance something about the environment in order to p^ceive it 
properly; thus one must come tacitly to understand a yariety of rules and to 
register a variety of facts in order to make sense of the inadequate deliverances 
of his visual system. The alternative, of course, 16 that one cannot know any- 
thing about the environment except as he perceives it, or has perceived it. On 
the alternative view, we should not treat perceptual learning as a matter of . 
supplementing inadequate data with information drawn from memory. < Rather we 
should see perceptual learning as a matter .of differentiating the complex, nested 
relationships in the dynamically structured medium — of tuning into invariants. 

But given the above, we may conjecture that perceptual learning is a func- 
tion of the dimensionality of the physical space in which the visual perceptual 
system operates. The frozen array, the limiting case of continuous nontrans- 
formation, provides a less than optimal set of conditions for visual perception; 
the invariants are fewer, more difficult to tune into, and less reliable. Dis- 
tinguishing bidimensional information poses a more difficult problem for the; 
visual perceptual system than distinguishing information of higher ditiiensions; 
learning in the former case should be slower and more devious. On this line of 
reasoning we may conjecture that reading, the distinguishing of information in 
the "letter array," is not a task to which the visual pesfceptual system is 
especially suited, despite its necessary involvement. 

REFERENCES 

4 

Arbib, M. A. (1972) The Metaphorical Brain: An Introduction to Cybernetics as' 
Airtificial Intelligence and Brain Theory . (New York: John Wiley & Sons). 

Averbach, E. and A. S. Coriell. (1961) Short-term memory in vision. Bell 
Syst. Tech. J. 40, 309-328. 

Bajcsy, R. (1973) Computer description of textured surfaces. In Third Inter- 
national Joint Conference on Artificial Intelligence . (Menlo Park, Calif.: 
Stanford Research Institute) . 

Baron, J. and I. Thurstone. (1973) An analysis of the word superiority effect. 
Cog. Psychol. 4, 207-208. 

iBiederman, I. (1972) Perceiving re al-wotld scenes. Science 177 , 77-79. 

Biederman, I., A. L. Glass, and E.'w. Stacy, Jr. (1973) Searching for objects 
\ /in reai-world scenes. J. Exp. Psychol. 97 , 22-27. 

Biederman, ^;I., J. C. Rabinowitz, A. L. Glass, and E. W. Stacy, Jr. .(1974) On 
the irilformation extracted from a glance at a scene. J, Exp. Psychol. 103 , 
597-600. 

Blakemore, C. and F. W. Campbell. (1969) On the existence of neurons in the 

human visual system selectively sensitive to the orientation and size of 

retinal images. J. Physiol. (London) 203 , 237-260^ 
Bower, T. G. R. (1970) Reading*by eye^^ In Basic Studies in Reading , ed. by 

H. Levin and J. Williams. (New York: Basic Books). 
Bower, X. G. R. (1971) The object in the world of the infant. Sci. Amer. 

225 (Oct.), 30-38. 

Br^d, J. (1971) Classification without identification in visual search. 
; Quart. J. Exp. Psychol. 23 , 178-186. 

Campbell, F. W. and L. Maffei. (1971) The til,t after-effect: A fresh look. 
Vision Res. 11, 833-840. 

Clark, 5. E. (1969) Retrieval of color information from the preperceptual stor- 
age jpy stem. J. Exp. Psychol. 82 , 263-266. 



58 



51 



Clowes, M. (1971) On' seeing things. Artificial Intelligence 2, 79-112. 
Coltheart, M. (1971) Visual^ feature-analyzers and af ter:-ef fects of tilt and 

curvature. Psychol. Res*. 7a » 114-121. 
Coltheart, M. (1972) Visual information processing. In New Horizons in 

Psychology , vol. 2, ed. by P. C. Dodwell. . (Harmondsworth, England: 

Penguin) . 

Cooper, L. A. and R. N. Shepard. (1973) Chronometric studies of the rotation 
of mental images. In Visual Infptmation Processing , ed. by W. G. Chase. 
(New York: Academic Press). 

Comsweet, T. N. (1970) Visual Perception . (New York: Academic Press). 

Craik, K. J. W. (1943) The Nature of Explanation . (Cambridge, England: 
University of Cambridge Press). 

Boost, R. and M. T. Turvey. (1971) Iconic memory and central processing capac- 
ity. Percept. Psychophys. 9, 269-274." 

Falk, G. (1972) Interpretation of imperfect line data aa a Jthree-dimensional • 
scene. Artificial Intelligence 3, 101-144. 

Favreau, 0. E. , V. F. Emerson, and M. C. Corballis. (1972) Motion perception: 
A color-contingent after effect. Science 196 , 78-79. 

Feynmann, R. (1967) The Character of Physical Law . (Cambridge, Mass.: MIT 

Press) . . i 
/ Flock, H. R. (1964) Some conditions sufficient for accurate monocular percep- 
tions of moving surface slants. J . Exp . Psychol . 67 , 560-572. • 

Franks, J. J. and J. D. Bransford. (1971) Abstraction of visual pi^tterns. 
J. Exp. Psychol. 90 , 65-74. 

Gibson, J. J. (1950) The Perception of the Visual World . (Boston: Houghton 
Mifflin). . I 

Gibson, J. J. (1958) Visually controlled locomotion and visual orientation in 
animals. Brit.^J. Psychol. 49 , 182-194. 

Gibson, J. J. (1959) Perception as a function of stimulation. In Psychology: 
A Study of a Science , vol. 1, ed. by S. K6ch. (New York: McGraw-Hill). 

Gibson, J. J.- (1961)-^ Ecological optics. Visioi^^R^search 1, 253-262. 

Gibson, J. J. (1966) The Senses Considered as Perceptual Systems . (Boston: 
Houghton Mifflin) . 

Gibson, J. J. (1971) The information available in pictures. Leonardo 4, 
27-35. 

Gibson, J. J. (1972) On the concept- of the "Visual Solid Angle" in an optic 
array and its history. Unpublished manuscript, Cornell University. , 

Gibson, J. J. (1973) What is meant by the processing of information? Unpub- 
lished manuscript, Cornell University. ^ 

Gibson, J. J. and E. J. Gibson. (1957) Continuous perspective transformations 
and the perception of rigid motion. J> Exp. Psychol. 54 , 129-138. 

Gibson, J. J. and M. Radner. (1937) Adaptation, af ter-ef feet ,-i|and conti^ast in 
the perception of tilted lines. I. Quantitative studies. J. Exp. Psychol. 
20, 453-467. 

Goodman, N. (1968) Languages of Art: An Approach to a Theory of S^bols . 
(Indianapolis: Bobbs-Merrill) . 

Gordon, I. E. and S. Hayward". (1973) Second-order isomorphism of internal rep- 
resentations of familiar faces. Percept. Psychophys. 14 , 334-336. 

Gregory, R. L. (1969) Oh how so little information controls so much behavior. 
In Towards a Theoretical Biology , vol. 2, ed. by C. H. Waddirigton. 
(Chicago: Aldine Publishing Co.). 

Gregory, R. L. (1970) The Intelligent Eye . (New York: McGraw-Hill). 

Gregory, R. L. (1972) Seeing as thinking: An active theory of perception. 
London Times Literary Supplement . June 23, 707-708. 



52 



59 



Guzman, A. (1969) Decomposition of a visual scene into three-dimensional 

bodies. In Automatic Interpretation and Classification of Images , ed. by 
A. Grasselli. (New York: Academic Press) . 

Haber, R. N. (1969) Information processing analyses of visual perception: An 
ih'troduction. In Information Processing Approaches to Visual Perception , 
ed. by R. N. Haber. (New York: Holt, Rinehart & Winston). 

Hagen, M. A. (1974) Picture perception: Toward a theoretical model. Psyc hol. 
Bull. 81, 471-497. 

Harris, C. S. and A. R. Gibson. (1968) Is orientation-specific color adapta- 
tion in human vision due to edge detectors, afterimages, or "dipoles?" 
Science 162, 1506-1507. 

Hawkins, J. K. (1970) Textural properties for pattern recognition. In Picture 
Processing and Psychopictorics . ed. by B. S. Lipkin. (New York: Academic 
Press) . 

Hebb, D. 0. (1949) The Organization of Behavior . (New York: John Wiley & 
Sons) * ^ 

Heider, E. R. and D. C. Oliver. (1972) The structure of the color space in 

' naming and memory for two languages. Cog., Psychol. 3. 337-354. 
Heider, F. (1959) On perception and event structure and the psychological en- 
vironment. Psychol. Iss. 1. No, 3. ^ 
Held, R. and S, R. Shattuck. (1971) Color and edge-sensitive channels in the' 

human visual system: Tuning fox orientation. Science 174 , 314-316. 
Helm, C. E. (1964) Multidimensional ratio scaling analysis of perceived color 

relations. J. Opt. Soc. Amer/ 54 , 256-262. 
Helmholtz, H. von. (1925) Treatise on Psychological Optics . Translated from 

the 3rd German edition (1909-1911) and edited by J. P. Southal^l (Rochester, 

N. Y.: Optical Society of America). 
Henderson, L. A, (1974) A word superiority effect without orthographic ^ 

assistance. Quart. J. Exp> Psychol. 26 , 301-311. 
Hochberg, J. (1968) In the mind's eye. In Contemporary Theory and Research in 

Visual Perception , ed. by R. N. Haber. (New York: Holt, Rinehart & 

Winston). 

Hochberg, J. (1970) Attention^ organization, and consciousness. In Attention: 

ConteiAporary Theory and Analysis , ed . by D. 1. Mostofsky. (New York: 

Appleton-Century-Crof ts) . 
Hochberg, J. (1974) Higher-order stimuli and inter-response coupling in the 

perception of the. visual world. In Perception: Essays in Honor of James J. 

Gibson , ed. by R, B. MacLeod and H. L. Pick, Jr. (Ithaca, N. Y. : Cornell 

University Pres^) . 

Hubel, D. H. and T. N. Wiesel. (1967) Receptive fields and functional architec- 
ture in two non-stri'ate visual areas (18 and 19) of the cat. J. Neuro- 
physiol. 30, 1561-1573.. ^ 

Hubel, D. H. and T. N. Wiesel. (1968) Receptive fields and functional architec- 
ture of monkey striate cortex* J. Physiol. 195 , 215-243. 

Ingling, N. (1972) Categorization: A mechanism for rapid information process- 
ing. J. Exp. Psychol. 94 , 239-243. 

Ittleson, W. H. (1960) Visual Space Perception . . (New York: Springer-Verlag) . 

Johansson, G. (1974) Projective transformations as determining visual space 
perception. In Perception: Essays in Honor of James J. Gibson , ed. by 
R. B. MacLeod and H. L. Pick, Jr. (Ithaca, N. Y.: Cornell University 
Press) . 

Johnston, I. R. , G. R. White, and R. W. Gumming. (1973) The role of optical 
expansion patterns in locomotor control. Amer. J. Psychol. 86 , 311-324. 



53 

60 - 



ERLC 



Johnston, J. C. and J. L. McClelland. (1974) Perception of letters in words: 

Seek not and ye shall find. Science 184 , 1192-1194. 
Jonides, J,, and H.'^ Gleitman. (1972) A conceptual category effect in visual 

search: 0 as a letter or a digit,, fercept. P3ychophys. 12 , 457-460. 
Kahneman, D. (1968) Method, findings, and theory in studies of visual masking. 

Psychol. Bull. 70. 404-426. 
Katz, J. J. (1971) The Underlying Realit^of Language and Its Philosophical 

Import , (New York: Harper & Row). 
Katz, L. anf D, Wicklund, (1972) Word scanning rate in good and poor readers. 

J/Educ, Psychql. 62 , 138-140. 
Kinaboume, M. and E. K. Warrington. (1962a) The effect* of an aftercoming 

random pattern on the perception of brief visual stimuli. Quart.. J. Exp> 

Psychol. 14. 223-234. 
TCinsboume, M. and E. K. Warrington. (1962b) Further studies on the masking of 

brief visual stimuli by a random pattern. Qu ar t . J < Exp « P sy c ho 1 . 14, 

235-245. ' ' ~ 

Kolers,'P. A. (1964) Apparent movement of a Necker cube. ^IfiH^r. J. Psychol. 77, 

220- 230. y > 

Kolers, P. A. (1966) An illusion that dissociates motion, object, knd meaning. . 
Quarterly Progress Report (Research Laboratory of Electronics, MIT) 82, 

221- 223*. ' ' . - . . 
Kolers, P. A. (1967) Comments on the session on visual recognition. In Mode la 

for the Perception of Speech and Visual Form , ed. by W. Wathen^^Dunn. 

(Cambridge, Mass.: MIT Press). " ' 

Kolers, P. A. (1968) Some psychological aspects of pattern recognition* In 

Recogni-zlng Patterns , ed. *by P. A. Kolers and M. ^kien* (Cambridge , Mass. : 

MIT Press) . • , ^ * > 

Kolers, P. A.. (1970) . fhree stagea of reading. In B^sic Studies in Reading , ed. 

by Jft.- teydn and J* ^||Llliaiti6* (J^eW Yorkt^ BasiQ' iook^) . 
Kolers, P. A. and R.^Pomeratitz^ (1971) Figural cSSnge in apparent motion. 

J, Expv Psychol. 87 , 99-108. * 
Kroll, N. E. A. , T- tarks, S. R. Parkinson, S. L. Bleber, and A- L. Johnson. 

(1970)n Short-term memory while shadowing: Recall of visually and of 
^uraa^Jy presented ^,ett6rs. J. Exp. Psychol. 85 , 220-224. 
. Lee,.D. N. (1974) Viiaual inform^.tion during loco^lOtion. In Perception; 

Essays ifi^ Honor of James J. Gibson , ed. by R.B. MacLeod and H. L. Pick, 

Jr. (Ithaca, N. Y. : Cornell University Press) • 
Lee, D. N. and E. Aronson. (1974) Visual proprioceptive control of standing in 

huma^ infants. Percept. Pgychophys, 15 , 529-532. 
Lishman, J. R. and Di N. Lee. (1973) The autonomy of visual kinaesthesis. 

Perception 2, 287-204. 
,]Mace, M. ,(1974) Ecologically stimulating cognitive psychology: Gibsonian 

perspectives. In Cognition and the Symbolic Processes , ed. by W. Weimer 
<^ and D. S. Palei^. (Hillsdale, N. J.: Lawrence Erlbaum Assoc.). 
Mace, W. M^^ (in press) James Gibson's strategy for perceiving: Ask not what's 

inside your head, but what your head's inside of. In Perceiving , Ac ting, and 

Comprehending: Toward an Ecological Psychology , ed. by R. E. Shaw and 

J. • Bransford. (Hillsdale, N. J.: Lawrence Erlbaum Assoc.). 
MacKdyr, D. M. (1967) Ways of looking at perception. In Models for the Percep- 

tion Qf Speech and Visual Form , fed. by W. Wathen-Dunn. (Cambridge, Mass.:* 

MIT Press) . • 

Mackworth, A. K. (1973) Interpreting pictures of polyhedral scenea. In Third 
Xntfemational Joint Conference on* Artificial Intelligence . (Menlo Park, 
Calif.: Stanford Research Institute). 
> 



Maffei, L. and A. Fiorentini. (1972) Processes of synthesis in visual percep- 
tion. Nature 240, 479-481. 
Marcel, A. J. (1974) Perception with and without awareness. Paper presented 

at Experimental Psychology Society Meetings, Stirling, Scotland , July . 
Marshall, J. D. and F. Newcombe. (1971) Patterns of paralexia. Paper presented 

at the International Neuropsychology Sjrmposium, Engelberg, Switzerland. 
Mayhew, J. E. W, and S. M. Anstis. (1972) Movement aftereffects contingent on 

color, intensity, and pattern. Percept. Psychophys. 12 , 77-85. 
McCulloch, W. S. (1945) A heterarchy of values determined by the topology of 

nervous nets. Bull. Math. Biophys, 1, 89-93. 
McCullough, C. (1965) Color adaptation of edge-detectors in the human visual 

system. Science 149 T 1115-1116. 
Merleau-Ponty , M. (1947) Phenomenology of Perception , trans, by C. Smith. 

(New York: Humanities Press, 1962). 
Minsky, M. (1963) Steps toward artifioial intelligence. In Computers and 

Thought , ed. by A. E. Fergenbaum and J. Feldman. (New Yofk: McGraw-Hill). 
Minsky, M, and S. Papert.^ (1972) Artificial Intelligence, Memo 252 . 

(Cambr'idge, Mass.: MIT Press). 
Moray, N. -^1967) Where is capacity limited? A survey and a model. Acta 

Psychol. 27, 84-92., , 
Murch, G. M. (1972) Binocular relationships in a size and color orientation 
^aftereffect. J. Exp. Psychal- 93 , 30-34. 
. Neisser, U. (1967) Cognitive Psychology . (New York: Appleton-Century-Crof ts) . 
^!.v/^' Novik, N. ' (1974) Developmental studies of backward visual masking. Unpublished 
'K' / Ph.D. thesis, University of - Connecticut . 

h "Ogle, K. N. (1962) Perception of distance and of size. In The Eye , vol. 4, ed. 

by H. Davson. (New York: Academic Press). 
Pastore, N. (1971) Selective History of Theories of Visual Perception : 1650- 

1950. (New York: Oxford University Press). 
Phillips, W. A. (1974) On the distinction between sensory storage' and short- 
term visual memory. Percept. Psychophys. 16 , 283-290. 
Phillips, W. A. and A. D. Baddeley. (1971) "Reaction time and short-term visual 

memory. Psychon. Sci. 22 , 73-74. 
Pittenger, J. B. and R. E. Shaw. (in press) Aging faces as viscal-elastic 

events: Implications for a theory of nonrigid shape perception. J, Exp. 

Psychol.: Human Perception and Performance . 
Pollen, D. A.,. J. R. Lee, and J. H. Taylor. (1971) How does the striate cortex 

begin the reconstruction of the visual world? Science 173 , 74-77. 
Posner, M. I. (1966) Components of skilled performance. Science 152 , 1712- 

1718. : 

Posner, M. I. (1969) Abstractibn and the process of recognition. In Psychology 
of Learning and Motivation , vol. 3, ed. by G. H. Bower and J. T. Spence. 
(New York: Academic Press) . 

Posner, M. I., S. J. Boies, W. H. Eichqlman, and R. L. Taylor. (1969) Reten- 
tion of visual and name codes of single letters. J. Expi Psychol. Monoer. 
79, 1-17. ^ 

Posner, M. I. and S. W. Keele. (1967) Decay of visual information from a 
single letter. Science 158, 137-139. 

Posner, M. I. and R. F. Mitchell. (1967) Chronometric analysis af classifica- 
tion. Psychol. Rev. 74 , 392-409. 

Reicher, G. M^ (1969) Perceptual recognition as a function of meaningfulness 
of stimulus material. J. Exp. Psychol. 81 , 275-280. 

Rlggs, L. A. (1973) Curvature as a feature of pattern vision. Sfcience 181 , 
1070-1072. 



/ERiC 



62 



55 



. ^ / 

1/ , - . • 

Roberts, L. G. (1965)- Machine perception of three-dimensional solids. In 

Electro-Optical Information Processing , ed. by J. T. Tippott. (Cambridge, 
Mass.: MIT Prese) . - ^ 
Rock^ I., F. Halper, aod T. Clayton. (1972) The perception and recognition of 

complex figures. Cog> Psychol. 3, 655-673. 
Ryle, G. (1949) The Concept of ,Mlnd . (London: Hutchinson). 

•Scharf,-B. and L. A. Lefton. (1970) Backward and forward masking as a function 

of stimulus and task parameters. J» Exp« Psychol. 84 , 331-338. 
Schiff, W. (1963) Perception of impending collision: A study of visually 

directed avoidance behavior. Psychol . Monogr . 79 , Whole No. 604. 
Schiff, W. , J. A. Caviness, and J. J. Gibson. (1962) Persistent fear responses 

in Rhesus monkeys to the optical stimulus of "looming." Science 136 , 

982j983. 

Schiller'', P. H. (1965) Monoptic and dichoptic visual masking by patterns and 
flashes. J. Exp. Psychol. 69 > 193-199. 

Sekuler, R. (1974) Spatial vision. In Annual Review of Psychology . (Palo 
Alto, Calif.: Annual Reviews, Inc.). ^ 

Selfridge, 0. G. and U., Neisser. (I960) Pattern recognition by machine. Sci. 
Amer. 203 (Aug.) , 60-68. 

Shaw, R. E. (1971) Cognition, simulation, and the problem of complexity. 
J. Structtiral Learning 2^ 31-44. 

Shaw, "-R. E. and M. Mclntyre. (1974) Algoristid" foundations to cognitive psy- 
chology. In Cognition and the Symbolic, Processes , ed. by W. Weimer and 
D. Palermo. (Hillsdale; N. J.: Lawrence Erlbaum Assoc.). 

Shaw, R. E;, M. Mclntyre, and W. M^ce. (1974) The role of symmetry in event 
perception. In Perception: Essays In Honor of James J. Gibson , ed. by 
R. B. MacLeod and H. L. Pick, Jr. (Ithaca, N. Y.: Cornell University 
Press). 

Shaw, R. E. and J. Pittenger. (in press) Perceiving the face of change in 

changing faces: Toward an event perception theory of shape. In Perceiving, 
Acting, and Comprehending;;: Toward an Ecological Psychology , ed. by R. Shaw 
and J. Bransford. (Hillsdale, N. J,: Lawrence Erlbaum Assoc.). 

Shaw, R. E. and B. E. WilsoA. (in press) Generative conceptual knowledge: 
How we know what we know. In Cognition and Instruction: Tenth Annual 
Camegie-Melloi^ Symposium on Information Processing , ed. by D. Klahr. 
(Hillsdale, N. J.: Lawrence Erlbaum Assoc.). ^ * 

Shepard, R. N. (1962) The analysis df proximities: Multidimensional scaling 
with an unknown distance function. I and II. Psychometrika 27 , 125-140 
and 214-246. 

Shepard, R. N. (in press) Studies of the form, formation, and transformation 
of internal representations. In Cognitive Mechanisms , ed. by E. Galanter. 
(Washington, D.C.: V. H. Winston & Sons).' 

Shepard, R.^N. and S. Chipman. (1970) Second-order isomorphism of internal 
representation: Shdpes of states. Cog. Psychol. 1> 1-17. 

Shepard, R. N. and C. Feng. (1972) A chronometric study of mental paper fold- 
ing. Cog. Psychol. 3> 228-243. 

Shepatd, R. N. and J. Metzler. (1971) Mental rotation of three-dimensional ob- 
jects. Science 171 , 701-703. 

Sherrington, C. S. (1906) The Integrative- Action of the Nervous System . 
(Cambridge, England: University of Cambridge Press). • 

Smith, M. C. and P. H. Schiller.^ (1?66) Forward and backward masking: A com- 
parison. Canad. J. Psychol. 20 , 337-342. 

Sperling, G. (1960) The information available in brief visual presentations 
Psyfchol. Monogr. 74 , Whole No. 498. 



56 



63 



Sperling, G. (1963) A model for visual memory tasks. Human, Factors 5, 19-31. 
Sperling* G. (1967) "Successive approximations to a model for short-term memory. 

Acta Psychol. 27 > -285r292. 
Sternberg, S. (1966) Highrspeed scanning in human memory. Science 153 , 652- 

654. 

Sternberg, S. (1967) Two operations in character-recognition: .Some evidence 

from reaction-time me;asurements. Percept . Psychophys . 2 , 45-53. 
Sternberg, S. (1969) Memory scanning i Mental processes revealed by reaction 

time experiments., Amer. Sclent. 57 , 421-457. 
Strdmeyer, C. F. and R, J, Mansfield. (1970) Colored aftereffects produced with 

moving images^ Percept. Psychophys. 7, 108-114. i 
Sutherland, N. S. (1973) Intelligent picture processing. JPaper presented at the 

Conference on the Evolution of the Nervous System and Behavior, Fldrida 

State University, Tallahassee. 
Thompson, D. W. (1917) On Growth and Form , 2nd ed. (Cambridge, England: 

University of Cambridge Press, 1942). 
Treisman, A., R. Russell, and J. Green. (1974) Brief visual storage of shape 

and movement. In Attention and Performance , vol. 5. , (New York: Academic 

Press). 

Turvey, M. T. (1972) Some aspects of selective readout from iconic storage. 

Raskins Laboratories Status Re^port on Speech Research SR-29/30 , 1-14 . , ^ 
Turvey, M. T. (1973) On peripheral" and central processes in vision: Infe-^^^ces 

from an information processing analysis of masking with patterned stimuli. 

Psychol.' Rev. 80, 1-52. 
Turvey, M. T. (1974) Constructive theory, perceptual systems, and tacit knowl- 
edge. In Cognition and the Symoblic Processes , efl. by W. Weimar and 

D. Palermo. (Hillsdale, N. J.: Lawrence Erlbaum Assocj ) . 
Turvey, M. T. and S. Kravetz. (1970) Retrieval from iconic memory with shape 

as the selection criterion. Percept. Psychophys. 8-, 171-172. 
Vanthoor, F. L. J. and E. G. J. Eijkman. ^(1973) Time coutse of the iconic 

memory signal. Acta PsychSl. 37 . 79-8!5. 
von Fieandf, K. and J. J. Gibson. (1959-) The sensitivity of the eye to two 

kinds of continuous transformation of a shadow^pattern. J. Exp. PsVchol. 

57, 344-347. ^ 
von Wrightv J- M. (1968) Selection in immediate memory. Quart. J. Exp. 

Psychol. 20 , 62-68. 
von Wright, J. M. (1970) On selection in visual immediate memory. Acta 

Psychol. 33, 280-292. ^ 
Walker, J. T. (1972) A texture-contingent vi-Bual motion aftereffect. Psychon. 

Sci. 28, 333-335. 

Wallach,^. and D. N. O'Connell. (1953) The kinetic depth effect. J. Exp. 

. Psychol. 45, 205-207. 
Warren, R. (1975) The 4>erception of egomotion. Unpublished Ph.D. thesis, 
Cornell University. 

Weisstein, N. (1969) What the frog's eye tells the human brain: Single cell 

analyzers in the human visual system. Psychol. Bull. 72 , 157-176. 
Weisstein, N. (1970) Neural S3nnbolic activity: A psychophysical measure.x 

Science 168 , 1489-1499. " 
Weisstein, N. A. (1971) W-shaped and U-shaped functions obtained for monoptic 

and dichoptic disk-disk masking. Percept. Psychophys. 9, 275-278. 
Weisstein, N. and Ci. S. Harris. (1974) Visual detection of line segments: An 

object-superiority effect. Science 186 , 752-754. 



57 



.Welsstein, N. , F. Montalvo, and G. Ozog. (1972) Differential adaptation to 

gratings blocked by cubes and gratings blbcked by hexagons: A test of the 
neural symbollc'*lu:tlvlty hypotheses, Psychon, Scl, 27 , 89-92. 

Weyl, H. (1952) Symm^y . (Princeton, N. J.: Princeton University Press). 

Wheeler, D. D. (1970) Processes In word recognition. Cog. Psychol. 1> 59-85. 

White, B. W., F. A. Saunders, Scadden, P. Bach-Y-Rlta, and C. C. Collins. 
(1970) Seeing with thVslcin. Percept^ Psychophys. 7, 23-27. 

Wlckens, D. D. (1972) Characteristics of word encoding. In Coding Processes 
in Human Memory , ed.. by A. W. Melton and E. Martin. (Washington, D.C. : 
V. H. Winston & JSons) . - ^ 

Wlnograd, T. (1972) Understanding natural language. Cog. Psychoid 3, 1-191. 



A 



I 




The Perception of Speech* \ 



C. J* Darwin 



— . ABSTRACT 

The general plan of this paper is first tio examine what problems 
our ability to perceive speech raise and then to look at some of the 
mechanisms proposed as possible solutions to these problems. The 
section on the structure of speech outlines the articulatory founda- 
tion for the basic invariance problems in speech perception; the next 
section on cues to phonemic categories illustrates the perceptual 
correlates of these articulatory constraints; while t;he sectlcrfi on 
auditory grouping and feature extraction describes ^some of the nlech- 
anisms that may be operating at early stages in auditory and speech 
perception. The last section looks briefly at suprasegmental fac- 
tors, such as prosody, and weighs the evidence for various models of 
speech perception. ' 

' - INTRODUCTION 

One of the most striki^ag phenomena in the perception of speech is the de- 
gree to which our conscious experience follows the semantic intention of the 
speaker. Our conscious perceptual world is composed of greetings, warningsT, 
questions, aiftd statements; while their vehicle, the segments of speech, goes 
largely unnoticed arid words are subordinated fo the framework of the phrase or 
sentence. Nor is this "striving after meaning" a mere artifact, confined to 
situations in which we want to understand rather than to analyze, since our 
ability to analyze speech into its components is itself influenced by higher- 
level units. The basic physical dimensions of the stimulus and^ even, under nor- 
mal listening conditions, the segments of speech are inaccessible to us direct- 
ly. This is/evident ih experiments on the perception of prosody and in 
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experiments in which subjects listen for a particular phonemic segment. The re- 
sults of these experiments parallel in an interesting way the perception o£ 
three-dimensional objects in vision. 

Lieberman /1965) for example, studied whether th^ intonation contours 
transcribed by linguists trained in a particular transcription system corre- 
sponded with the objective pitch present in the stimulus. He found that many of 
the pitch features described by his linguists were not. In fact present but were 
introduced by the perceived syntactic structure of the sentence. The pitch the 
linguists hear4 was often that which their system told them should accompany a 
particular syntactic structure. Somewhat similar results haye b^en reported by 
Hadding-Koch and Studdert-Kennedy (1964). They found that subjects' judgments 
about whether a short utterance has a rise in pitch at the end or not were In- 
fluenced by the pitch contour in the earlier part of the utterance in a way that 
is thought to reflect aerodynamic constraints on spoken pitch (Lieberman, 1967). 
In a similar vein., Huggins (1972a) found that subjects had lower thresholds for 
detecting changes in the duration of segments of speech from "normal" when they 
based their 'judgments or^ the rhythm or stress of the utterance than if they 
tried to listen for duration itself. Even such familiar- dimensions as pitch and 
duration seem to be relatively inaccessible to conscious experience. There is 
perhaps a parallel here with the problems that face a pointer trying to represent 
a three-dimensional object in, two dimensions. Gombrich (1960) describes how \ 
different stylistic schools resort to a repertoire of visual tricks or jpchemata 
to achieve the two-dimensional representation, of a scene, attaining, like 
Lieberman' s linguists, a representation determined as much by .their training as 
by the object portrayed. In both cases a more objective 'tepresentation can be 
attained by resorting to artificial aids, which distract attention from the 
meaning of the elements portrayed. Diirer illustrates one extreme way of doing 
this in his woodcut of a draftsman drawing a recl;Lning nude by viewing her j^rom 
a fixed point in space through a transparent grid that corresponds to a similar 
grid on his canvas (Gombrich, 1960:306). JLnythe B^me way Daniel Jones tran-f 
scribed intonation contours by lifting the needle from a phonograph record in 
mldsentence and judging the last-heard pitch. In both these cases some extertjal 
segmentation has been imposed on the perceptual whole to remove some of the 
changes caused by semantic interpretation. Without this aid the listener can 
only reconstruct properties of the stimulus from what ,he knows about higher- 
level categories. 

This recohs true t ion also appears to be operating in experiments In which 
the subject is asked to listen for a particular speech segment. Following 
traditional phonetics, we will use the term "phoneme" to refer to the smallest 
segment of speech that distinguishes two words of different meaning. The intui- 
tive value of this concept outweighs, for the purposes of this paper, the theo- 
retical problem that it raises, the words "kine" and "Cain" thus share initial 
^and final phonemes (/k/ and /n/, respectively) but differ in the medial diph- 
thong. When subjects listening to a list of two-syllable words are asked to 
press a key whenever they hear a particular phoneme, they do this more slowly 
than .when they are asked to detect a whole syllable;, this in turn is slower than 
detecting a whole word (Savin £nd Bever, 1970; Fobs and Swljnney, 1973). More- ^ 
over; the time to detect a given phoneme at the beginning of a syllable is 
shorter if that syllable is a word than if it Is not (Rubin, Turvey, and 
Van Gelder, in press). Here the subject's performance is influenced by larger 
units of analysis despite the needs of the task. If these higher levels are 
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removed so ^h|.t ;ph6nemes are being targeted for in lists of phonemes, and sylla- 
bles in lisiis o^:;.syllables, then the smaller imit^ are in fact dete^ited faster " 
than the larger (McNeill and Lindig, 1973). Our conscious awa^ness, then,/is 
driven to the highest level present in the stimulus, allowing lower levels to be 
accessible only as a subsequent operation on these higher units. ^ 

This somewhat paradoxical finding, that conscious decisions at a higher 
level can be made before decisions at a lower level,, though striking in speech, 
is also encountered in written language. Letters embedded in words are identi-"" 
fied more accurately than letters in nonword strings or in isolation (Reicher, 
1969; Wheeler, 197?) provided, again paradoxically, that subjects do not know in 
which position in the word the. letter that they have to report will occur 
(Johnston and McClelland, 1974), Although both the speech and the written lan- 
guage results demand an interpretation that stresses the importance of higher- 
order units in perc-eption, ^there are interesting dlfferetftes between the two 
modalities. « * 

First, subjects are quicker at detecting an isolated phoneme than otie em- 
bedded in a word (McNeill and Lindig, 1973); in vision isolatad letters are 
identified less accurately than those in words (Wheeler, 1970). Second, knowing 
where in a word a targe't will appear does not abolish the relative difficulty of 
detecting that phoneme, compared with detecting the whole word (Foss and Swinney, 
1973; McNeill and Lindig, 1973); in vision, on the other hand, knowing where in 
a string the target letter will appear reverses the usual advantage of the 
string being a word < Johns ton and McClelland, 1974). This latter discrepancy 
may perhaps be taken as illustrating the relative ease with which spatially de- 
fined portions of a visual stimulus may be attended to compared with temporally 
defined portions, of an auditory stimulus. A more attractive explanation is that 
these differences arise because of the basically different nature of spoken and 
written language. Embedding a phoneme in a syllable, or a syllable in a word, 
is, as we shall see, very different from embedding a letter or string of letters 
in a type-written word. The whole physical representation of the phonemic ele- 
ment can be completely restructured in a way that prevents the subject from be- 
ing able to attend to it or detect it without alsa taking account; of the context 
in which it occurs. 

To appreciate the Implications of this restructuring for perception, let us 
follow the changes that take place in semantic elements as they are described 
linguistically at different levels. The choice of linguistic systems is to some 
extent arbitrary, and wfe will dip^both into contemporary generative phonology 
and into acoustic and articulatory phonetics. ^ 

THE STRUCTURE OF SPEECH 

There is growing psychological evidence that the morpheme is a significant 
unit in word recognition. It is more powerful than the word at explaining fre- 
quency effects in perception (Murrell* and Morton, 1974), and seems to be able to 
explain effects of word length on perception that have previously been attri- 
buted to number of syllables.^ This evidence is primarily from studdes of the 
written word but it is probable that similar effects could also.be obtained for 
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speech, particularly as in generative phonology the systeioatic phonemic level 
bears striking similarities to the written word. The systematic phonemic level 
represents the linguistic message as an ordered sequence of elements that pre- 
serve morphemic invariance, but that can be mapped, using the particular rules 
pf a language or dialect, into a phonetic representation. A speech sjmthesis- 
by-rule program could then be used to derive an intelligible utterance from such 
a representation. At the systematic phonemic level tihe words "courage" and 
"courageous" can be represented as /koraege/ and /kordge+as/, forms that bear an 
interesting resemblance to the spelled word (Chomsky and Halle, 1968; Chomsky, 
1970; Goujgh, 1972), Quite specific to speech though is the restructuring that 
takes place between this systematic phonemic level and the phonetic level to 
give strings such as [k/raj] and [kare^jas] (Chomsky and Halle, 1968:235), forms 
from which an articulatory specification and thence an acoustic sign^ could be 
derived. This variation between the systematic phonemic level and the^jihonetic 
is one that, though not^^nerally accessible to the naive j^istener is nonethe- 
less rfeadily dlstinguish^^y a phonetician. In the exasp^e clte^v above the^ 
change in the second vowel is not difficult to,. apprehend, Sut laore subtle 
changes are covered by the same level of , rules (such as th'e chknge in aspiration 
in /the /p/ in "pit" 'and "spit"), which are difficult to perceive as such except 
by the tifained ear. This variation has been termed extrinsic allophonlc varia- 
tion (Wang and Fillmore, 1961;* Ladefoged, 1966) as distlACt from intrinsic ^ 
which is not accessible even to the trained ear of the phonetician. To appreci- 
ate the problems raised by intrinsic allophonlc variation we need to consfder in 
outline thse mechanisms ^y which speech is produced . 

In normal (rather than whispered) speech, the main source of sound is the 
vibration of the vocal, cords. These are set into vibration by airflow from the 
lungs, the frequency of the vibration being determined jointly by the stiffness 
of the cords and the pressure drop across them (see MacNeilage a^d Ladefoged, In 
press). The waveform of the sound at this stage is roughly an asymmetrical tri- 
angle whose spectrum, for a continuously held pitch, is- a series of harmonics of 
the fugl&amental, decreasing in amplitude with^ increasing frequency. The effect 
of the cavitie3 of the mouth (and of the nose when they are coupled in, as during 
nasal consonants and nasalized vowels) is to change this spectrum so that well- 
defined broad peaks occur in it, corresponding to the resonant frequencies of 
the system of Gravities. These broad peaks are formants and their frequencies / 
and amplitudes vary in a well-understood way (Fant, 1960) with the changing 
shape of the vocal tract, as, the various articulators move to give formant tran- 
sition^j. The values of t^he formants are independent of the pitch of the voice, 
'althougla the accuracy with which a formant peak can be estimated from the har- 
monic structure' depends on the particular pitch present. For unvoiced conso- 
nants such as [p,f]i or in whispered^^ speech, the vocal cords do not vibrate but 
instead the sound source is noise from. turbulent air either at the glottis (for 
[h] aspiration and whisper) or at some other point of constriction. Peaks cor- 
responding to the vocal-ti|;act resonant frequencies still exist, but the spectrum 
will be continuous rather than composed of harmonic lines. The peaks found in 
the npise spectrum will reflect mainly the resonances of cavities in front of 
the point of constriction at which the noise is generated. 

The acoustic speech .signal reflects only indirectly the movements of the 
individual articulators of the vocal tract. ' These articulators can move with a 
large degree of independence: the lips move independently of the tongue, dif- 
ferent parts of the tongue have some iildependence (Ohman, 1967), and whether the 
vocal cords vibrate or not depends but little on the movements of the 
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supralaryngeal tract provided the airflow is not blockeS^ This ability o£ in- 
deperident movement of the articulators would present little problem if speech 
consisted of sequence of rigidly 'defined ivocal-tract positionis, with every . 
disti^tictivd linguistic unit having a cbncoriit^nt vocal-tract configuration* Un-' 
fortunately,^ this i^^ not the case. Only in the production of continuously ^held 
vowels does the vocal tract come close to this ideal, In that a change in the 
position of any tarticulator will modify ; the quality of- the y owe 1. For conso- 
nants, though, the ; situation is much more complicated'ji They are produced by 
constricting the vocal tract at a particular place — the place of articulation — 
and in a particular manner. The constriction formed by different^ma^iners of ar- . 
Ibiculation can either be complete (as for, the stops [l?,d, g,p, t,k]) , complete for 
the oral cavities but with th^ soft palate lowered to allow airflpw through the 
ncfse '(as for the nasals [m,n,i3]), incomplete but sufficiently close to give tur- 
bulent noise (a:s in fricatives [f ,s,J,v,z,3]) , complete closure in the midline 
but with space for airflow at the sides (as with the lateral [1]), or so incom- 
plete as to approximate extreme vowel positions (as in the semivowels /w,j/). 
^Definition of the place and manner of the articulation is sufficient together 
with voicing to define the consonant; provided this articulation is accomplished, 
th^ articulatory mechanisms not involved with this gesture are free to get on 
with whatever the forthcoming phonemes require. It is this property of coarticu- 
lation i:hat is one source of the intrinsic variation between the signal and the 
phoneme. A well-known extreme example is that in the syllable /stru/ the lips 
are free to round in anticipation of the vowel three phonemes later irrespective 
of word boundaries (Daniloff and Moll, 1968).. Similar and perceptually signifi- 
cant coarticulations occur with nasality (Aii, Gallagher, Goldstein, and 
Daniloff, 1971; Mqll and Daniloff, 1971), which ate also unconstraiijed by word 
boundaries (Dixit and MacNeilage, 1972). 

A related source 6f variation is that for most of the time the vocal tract 
is moving from one target position to another so that information about which 
targets are* l^eing approached or left is carried by transitional information, 
which of its very nature depends both on the immediately adjoining arid more dis- 
tant targets. For example, the formant transitions produced by the rapid move- . 
ment away from a point of constriction into a subsequent vowel can be a suffi- 
cient ciie to the location of that point of constriction (Cooper, Delattre, 
Liberman, Borst, and IS^rstman, 1952). 

A further complication comes from the articulation of vowels in normal fast 
speech being considerably less differentiated than in the citation of individual 
words (Shearme and Holmes, 1962). In addition, in rapid speech, the target ar- 
ticulation position for a vowel may not be reached before the tongue must move 
back toward the target position for the next phoneme. This articulatory under- 
shoot necessarily Implies a corresponding acoustic undershoot so that instead of 
reaching and hbl'ding a target position the formants go through maxima and minima 
short of their target (Lindblom, 1963) . ' • 

These examples illustrate some pf the production mechanisms responsible for 
"intrinsic" allophonic variation. But they do not exhaust the sources of varia- 
tion; for, as St udder t-Kennedy (1974) points out, between-speaker but within- 
dialect variation is not covered by either of the categories ^intrinsic or extrin- 
sic. Th^problem here is that different speakers have different-sized head§, so 
that the rbCTiants produced when a child articulates a particular vowel are 
higher in frequency than those produced by an adult. Even within normal adults 
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therC is a variation of about 20 percent (Peterson -and Barney, 195'2) . Moreover, 
this is not a simple scaling problem. The values of the individual^ formants for 
different vowels change by different proportions for male and female speakers in 
a way that suggests that men have proportionately larger pharynxes than women 
(Fant, 1966). 

CUES TO PHONEMIC CATEGORIES ^ . 

There is abundant evidence that articUlatory mechanisms structure the 
acoustic signal in such a way that in general there is no simple relationship 
between a phonetic category and those sounds th^t are sufficient to cue it. 
Work on the perception of synthetic speech has made this point elegantly, enabl- 
ing us to understand the relationship, between phonetic categories /and their cues 
sufficiently well to produce intelligible speech automatically from a phonetic 
sjmibol input (Liberman, Ingemann, Lisker, Delattre, and Cooper, 1959; Holmes, 
Mattingly, and Shearme, 1964; Kuhn, 1973). What is perhaps less well understood 
is the degree to which the cues that have been shown to be important in synthet- 
ic speech are also important in natural speech. Subjects can be variable in the 
way they react to synthetic speech and yet natural speech retain^ its intelligi- 
bility under a wide range of distortions. It is possible that this discrepancy 
is due to natural speech containing a wider variety of cues than has been in^- 
vestigated systematically with synthetic speech and also to different listeners 
using the particular cues that are used in the synthesis to different extents. 
But another factor is undoubtedly that we do not yet understand sufficiently the 
changes in the cues to segments that occur .with contex;t. While this has been 
emphasized foi: such effects as that of the neighboring vowels on a consonant 
(Delattre^, Xiberman, and Cooper, 1955; Ohman, 1966), other interactions — such as 
cues to consonants in clusters, changes that occur with word boundaries, and 
changes with stress pattern^ and speaking rate — have received very little atten- 
tion in perceptual experiments until quite recently. 

Vowels , 

Although vowels produced in isolated syllables can be adeq[uately distin- 
guished by the steady-state values of their first two or three formants, it is 
unlikely that vowel perception in running speech can be dealt with in such, a 
simple way. Normal continuous speech introduces at least two complications: 
speaker change and/ rapid articulation. Using synthetic two-formant patterns, 
Ladefoged and Broadbent (1957) showed that varying the range of frequencies used 
in a precursor sentence influenced the vowel quality attributed to a fixed for- 
mant pattern at the end of the sentence. Similar effects have been found for 
consonants (Fourcin, 1968). A particular formant pattern is thus perceived rela- 
tive to some frame that is characteristic of the particular speaker. Informa- 
tion about this frame can be provided either by a precursor sentence or, it 
seems, by the syllable that contains the test vowel itself. Shankweiler, 
Strange, and Verbrugge (in press) have shown that k potpourri of vowels produced 
by many different speakers is more intelligible if the vowels are flanked by 
consonants than if they are spoken in isolation without the consonants. The 
dynamic information from the consonant transitions may limit the possible tract 
configurations that could have produced the syllable more effectively than a 
steady state could, or it may simply allow the formants to be detected more, 
accurately. \, . 



Embedding a vowel between consonants can, at rapid rates of articulation, 
prevent the articulators and hence the foxnnant pattern from reaching the target 
•values (Lindblom, 1963). Nevertheless, the perceptual system appears able to 
compensate for this, so that a vowel is perceived whose steady-state formant 
values would have been more extreme than those actually reached in the syllable 
(Lindblom and Studdert-Kennedy, 1967). ,This is illustrated in Figure 1. 

Stops 

Stop consonants in intervocalic position are characterized by an abrupt ai^d 
complete closure (or stopping) of the vocal tract at some point of articulation, 
followed by an abrupt release of this closure. With voiceless stops the vocal 
folds cease vibrating at closure and do not ^tart again until some time after 
release. The period of closure is thus silent. For the voiced stops though, 
the vocal cords continue to vibrate throughout the period of closure (provided 
it is not too long) producing a low-amplitude, low-frequency sound. The abrupt 
changes in amplitude at closure and release are cues to the presence of a stop 
consonant; /slit/ will change to /split/ if a brief period of silence is intro- 
duced between the friction and the vowel (Bastian, Eimas, and Liberman, 1961). 
Other distinctive acoustic events indicating the presence of a stop consonant 
are the burst of energy at release and, for voiced stops, the rapid rise in the 
first formant after release.* 

When a stop is released, there is a sudden drop in pressure in the mouth 
cavity and a rapid flow of air through the widening constriction. The initial 
drop in pressure gives an impulsive excitation to the mouth cavities that is 
followed by a brief period of frication from the turbulent airflow through the 
constriction. As the constriction widens, the airflow becomes smooth and then 
the only source of excitation is that from the glottis, which for voiced sounds 
will be air pulses, and for voiceless sounds noise produced by turbulent airflow 
at the glottal constriction. The spectrum of the emergent sound is predominant- 
ly determined by the cavities in front of the source of energy. Formant struc-^ 
ture similar to that for voiced sounds appears in the noise originating from the 
glottis (aspiration), but to a much lesser extent in the^ noise originating at a 
supraglottal place of articulation (burst and frication), which reflects predom- 
inantly the resonant frequency of the cavity in front of the place of articula- 
tion. The formant pattern changes as the articulators move away from the point^^ 
of articulation into the position appropriate for the next segment. 

A voiceless stop can be cued simply by putting a burst of noise at a suit- 
able frequency, a short time in front of a vowel. What place of articulation is 
h^ard depends on the frequency of the noise and on the vowel that follows. It 
is' possible to make the, same burst of noise sound like two different consonants 
by placing it before different vowels [e.g., /pi/, /ka/, /pu/ (Liberman, 
Delattre, and Cooper,, 1952) ]. Similarly the significance in the change of the 
formant pattern depends crucially on what vowel the formant s lead to (Delattre, 
Liberman, and Cooper, 1955). The reason for this dependency is partly that <:he 
formants have to lead into the vowel, but it is also because during closure the 
articulators are in a position that anticipates the forthcoming vowel. Because ^ 
of this coarticulation, the formant pattern at release will vary with the vowel. 
A recent paper by Kuhn (1975) points out that the formant transition that 
appears to carry the burden of cuing the place of articulation of the stop is 
the bne that for a particular stop-vowel combination is associated with the 
mouth cavity. The curiosity of the burst of noise at 1400 Hz that cues /pi/. 



65 



STYLIZED SPECTROGRAMS OF SYNTHESIZED SYLLABLES 
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Figure 1: The^ upper panel shows stylized spectrograms of synthetic syllables 
dif f ereing' in the maximum values attained by the second and third formants. The 
lower panel shows how subjects heard vowels distributed along a continuum between 
these maxima. The null condition vowels lack transitions: they hav^v steady- 
state fonnants at the maximum values attained by the '^w^w" patterns. '^As may be 
seen from the leftward shift of the /u-i/ boundary in the "w-w" series in the 
sang© way as steady-state patterns having formant values beyond their maxima. 
This "overshoot" is greater for fast (100 msec) than for slow (200 msec) patterns 
(Figures 2 and 3, Studdert-Kennedy and Cooper, 1966; reproduced with permission 
of publisher and .authors) . ' 
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/ka/, and /pu/ is described by Kuhn as follows: "Before /i/ this burst appears 
to be interpreted as part of the rise in frequency of the front cavity reso- 
nance as it moves up to F3. Before /a/, the burst appears to be interpreted as 
part of the fall in frequency of the front cavity as it, moves to a slightly Tower 
value in Before- /u/ it appears to be interpreted aa. part of a flat,, lip-, 

release spectrum and was a somewhat worse cue." 

ft *v 

For intervocalic stops 'with different vowels on either side (e.g., /idu/) 
it has been ^^^ow^ by Ohman (1966) that the formant transitions both into^and out 
of the stop closure are jointly influenced by both vowels. However, the percepi- 
tual significance of this observation is not entirely cleaj. Two independent 
experiments have failed to find any perceptual correlate of this coarticul.ation 
effect. Fant," Liljencrants, Malac, and Borovicfcova (1970) using both natural 
and synthetic speech could find no general effect of this coarticulation on in- . 
telligibility, and^Lehiste and Shockey (1972X found that listenefs could nat 
juci'gia the missing vowel in the vowel-consonant (VC) part of natural vowel- ^ 
consonant-vowel (VCV) utterances, even though they clafm that coarticulation 
effects similar to those observed by Ohman^ (1966) could be se3n in spectrograms 
>of their stimtlli. These two experiments are particularly interesting since they 
are a rare example of perception seemingly not being sensitive to articulatory 
constraints. < ^ 

Invariant Cues for Stops? - - 

The issue has been revived recently by Cole and Scott (1974a, 1974b) whether 
in natural speech there exist ^invaridnt cues or combinations of cues uniquely 
specifying particular consonants independent of the succeeding vowel. The claim 
made by Cole and Scott is-^that "place of articulation [for stops], is signaled 
by a set of cues which form invariant patterns for /b/, /d/, /p/, /t/, /k/ in 
initial position before a^ vowel in a stressed syllable. .." (1974b: 359) . 

This claim is clearly at odds with the results from synthetic speech shov;^ 
ing that neither the burst (Liberinan, Delattre, and Cooper, 1952) noif the for- 
mant transitions (pelattre, Liberman, and Cooper, 1955.) of adequate synthetic 
syllables are invariant with vowel context. One resolution of this difference 
is to presume that the synthetic speech studies have missed some important cue 
or combination of cues ^that are responsible for invariant perception in natural 
speech. A closer inspection of the data on which Cole and Scott's conclusion is 
based, however, suggests that fhere is no real discrepancy. Their conclusion is 
made possible only by their being selective <tn the results they consider, by a 
loose interpretation of "invariant J' and by an ambiguous use of the term "burst." 

Briefly the evidence is this/ if the burgt (excluding the aspiration) from 
a natural voiceless stop produced in syllable-initial position before one vowel 
is spliced onto a different vowel containing a^brief period of aspiration with- 
out formant transitions (as formed in producing /h/), then a stop is heard whosfe 
place of articulation will vary with the choice of vowels in a way that is con- 
sistent with the results from synthetic speech (Liberman, Delattre, and Cdoper, 
1952; Schatz, 1954). In particular, the same^burst will be heaxd'as /p/ before 
/i/ and /u/, but as /-k/ before /a/. If instead of ^ust the burst a longer por- 
tion of the sound from a voiceleSs consonant is removed and translated onto a 
different steady-state vowel, then again subjects' percepts do not change if the 
vowels interchanged are /i/ and /u/) but they are also relatively little affected 
if the vowel /a/ is included (Cole and Scottf, 1974a). The discrepancy here is 
readily explicable if w& examine what information is being translated from one 
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vowel context to another. In Cole and Scott's exi5erimen't the duration of ^sounds 
translated were: /b/ 20 msec, /d/ 30 msec, /g/ ,40 msec, /p/ 50 msec, /t/ S!0 
msec, and /k/ 100 msec. With the possible exception of /b/ and /d/, .these dura- 
tions, are sufficient to give considerable information about the* following vowel 
by virtue of formant transitions within the aspiration. Indeed, an experiment 
cited by Cole and Scott (1974a) in support of their claim dempnstrates this 
point. Winitz, Scheib, and Reeds (1972) removed all but the burst and aspira- 
tion' from natural tokens of /p,t,k/ spoken before the vowels /i, a, u/, and played 
the resulting sounds to subjects for . identification either in isolation or fol- 
lowed' by a 100-msec steady state of the same vowel before which they had been 
spoken. The mean durations of the translated segments were 70, 77, and 93 msec 
for /p,t,k/, respectively. Subjects were asked to identify in both these types , 
pf sound either the consonant or the vowel. The results showed that for the 
sounds consisting only of the burst and aspiration the correct consonant was 
identified 65 percent of ^the time and the 'correct vowel 64 percent. Adding the 
steady-state vowel raised the scores to 71 and 86 percent, respectively. In 
these data the consonant is no more identifiable than the vowel. Cole and 
ScQ'tt's (1974a) procedure is sufficiently close to that of Winitz et al. to let 
us presume that similar results would have been .obtained with Cole and Scott's 
sounds*, had they asked their subjects to identify the vowel in the isolated 
sounds. Although Cole and Scott claim that only the burst was translated > the 
durations used in their sounds clearly allpws the acknowledged possibility that 
their sounds contained forma,nt transitions. However, they claim ( pace Winitz, 
Scheib, and Reeds) that these transitions should not have aided perception, 
citing the claim by Liberman, Cooper, Shankweiler, and Studdert-Kerinedy (1967: 
436) that formant transitions are not commutable between vowels. In fact, the 
claim made by Liberman and colleagues is that in formant transition patterns 
there is no commutable stop-consonant segment , since a slice from t^^^beginning 
of a formant pattern of a complete syllable^ will either be heard as aViqnspeech 
sound, or as a stop consonant followed by some vowel . This does not of course 
imply that following formant transitions by an inappropriate steady-state vowel 
prohibits the perceptual system from using the transition information to intej:-^ . 
pret the previous burst, just as it does when there is no additional vowel 
spliced on. Indeed, for some of the stimuli used by Cole and Scott (1974a) i-t 
is likely that listeners heard two vowels. 

The apparent discrepancy between Cole and Scott's experiment, on the one 
hand, and those of Scha^tz (1954) and Libeman et al. (1967) on the other, is 
thus attributable to.^^the longer stimuli, excised by Cole and Scott, carrying in- 
foi^iation about the following vowel. Since these stimuli probably contained 
sufficient information for the following vowel to be identified^ Cole and Scott 
are no closer to demonstrating perceptual invariance of^stop consonants with 
vowel context than if they had removed none, of the vowel. 

Fricatives, Nasals, and Liquids 

Th^ perception of/ friCja-tives and nasals "is less controversial, in both 
fricatives and nasals in ayllable-initial position thete is a long .period of 
ste'ady state followed by transitions into the following vowel. The nature of 
the steady state provides th^ mair^ cue to the manner of articulation of the con- 
sonant and also provides some itfformatian on, the place of articulation. The 
nasal murmurs produced at different places' of articulation are quite similar 
siijce the oral cavity acts only as a side chamber to the nasal cavities from 
which the gwDund is radiat^jsd. The different nasals can be distinguished on the 
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basis of their nasal -murmurs alone, but the bulk of the place information is 
carried by the formant transitions (Liberman, Delattfei Cooper, and Gerstman, 
-1954; Malecot, 1956). 

In contrast to nasals, fricative spectra (with the exception of /f,6/ and 
their voiced cognates /v,3/) are .markedly dissimilar (Strevens, 1960) and carry 
the bulk of the perceptual load for place of articulation. Harris (1958) seg- 
mented naturally spoken fricative-vowel syllables into a fricative portion and 
a transition + vowel pottion, which were then commuted. She found that except 
for the /f,6/ and /v,3/ distinctions the place of articulation was perceived 
according to the fricative spectrum rather than the formant transitions. 

Steady-state friction is the nearest that speech comes to a one-to-one map- 
ping between sound and phonetic category. Yet even here there is variation with 
speaker and context; which, though not sufficient to cau^e eopfusion between the 
fricative categories, can serve to distinguish the sex -of the speaker (Schwartz, 
1968) and, through weak-formant transitions within the noise, to cue place of 
articulation of subsequent stops (Schwartz, 1967). 

The liquids /r',1/ are characterized by having a brief (or in some contexts 
nonexistent) steady st^te with a low first formant, followed by a rather ^low 
transition into the following vowel. The speed of the transition together with 
changes in the second and third formant cue the presence of the liquid segment, 
but It I seems to be distinguished from /I/ primarily by changes in the third 
formant (Lisker, 1957b; O'Connor, Gerstman, Liberman, Delatrtre, and Cooper, 
1957). ^ 

Voicing 

The dimension of voicing, which has received intensive study recently, pro- 
vides perhaps the best example of the intimate relationship between the articu- 
latory- acoustic constraints that sjiape -the stimulus and the mechanisms used tojp 
perceive the phonejnic category. In final consonants voicing can be cued by the 
duration of the preceding vowel (Denes, 1955; Raphael, 1972) and in. ppsts tressed 
intervocalic stops by the duration of the stop closure (Lisker, 1957a), however, 
the perception of stops in utterance initial prestressed position has received 
the most attention. 



Lisker and Abramson (1964) exploited 'the concept of voice onset time (VOT) 
to describe the differences in speech production between the various categories 
of voicing in stop consonants. This dimension classifies a particular stop 
utterance according to the time difference betweeti the vocal cords starting to 
vibrate and the stop\closure being released* For English voiced stops [b^d',g] 
)/ in utteran^ce initial pkosition, t^is time is usually either around zero or nega- 
tive (the vocal folds starting to vib^rate before the stop, is released), while 
for the English voiceless aspirated stops [ph,th,k^]- (as in pot, * tot, cot) there 
is a lag of between 20 and 100 msec from the release of the stop to the onset of 
voicing. The value 'of this dimension is twofold. First it adQquately^de;ipe^ibes 
the categorization of differently voiced stops fTom many^languages,^ different 
stops from similar context falling into clusters along the VOT "continuum, which 
aire nonoverlapping for stops at the beginning of isolated- words and show only 
slight jpverlap for words spoken in sentences (Lisker and Abramson^ 1967). Second 
it explains the changes in tfie acoustic cues accompanying different stop-con^- 
nant vbicings* A voiceless stop will normally have a moire intense burst, a 
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lon'ger period of aspiration/, and a weaker first formant during the aspiration 
than the voiced homolog. All these changes can be adequately explained by the 
effect of changing the time for which the vocal folds are held apart, inhibiting 
voicing. Abduction of the vocal folds allows a greater pressure to build up 
within the oral cavity, leading to a stronger burst on release. In addition, it 
provides only noise excitation, which is weaj&er in low-frequency energy than in 
voicing, and it acoustically couples the t^chea to the oral cavities; these last 
two factorslare responsible for reducing the iiiitensity of the first formant. 
.Spectrograms of real speech illustrating three different values of VOT are shown 
in Figure 2(a) . 

Perceptually, this complex of cues resulting from a single articulatory 
dimension raises the question of which cues are used and how are they combined. 
Lisker and Abramson (1970) have shown that synthetic syllables differing in VOT 
can be perceived as differing in voicing. But the syllables they used varied a 
number of the concomitant acoustic cues including both the actual time of onset 
of voiced excitation and the intensity of the f irst-formant transition. It can 
be shown that both these acoustic correlates of voice onset time are perceptual- 
ly important. It was clear from the early synthetic studies of voicing that to 
produce a good token of /p,t,k/ there had to be both aspiration present during 
the initial period of formant transitions and a reduction (or cutback) in the 
amplitude of the f^rst- formant 'transition (Liberman, Delattre, and Cooper, 1958). 
The importance of the f irst-formant transition has again been stressed in recent 
work by Stevens and Klatt (1974) and by Summerfield and Haggard (1974). At a 
given VOT, perceived voicing can be influenced by a first-f ormant transition 
after the onset of voicing or by the amount of energy in the aspirated portion 
of the first formant. Interesting questions arei raised by the known variation 
in VOT with such contextual factors as rate of articulation (Lisker and Abramson, 
1967; Summerfield, 1974) and the nature of the following segment (Xiisker, 
1961; Klatt, 1973). Although the use of 'the f irst-formant transition as a cue 
may allow the increase in VOT brought about by the presence of a liquid after 
the stop to be compensated for directly (Darwin and Brady, 1975) , it is aljgd 
likely that other context effects demand a change in the weightings attached to 
the various cues (Summerfield and Haggard, 19^4). 

I 

Syllable Boundaries 

A phonemic representation of speech needs to include some way of represent- 
ing the perceptible difference between such phrases as "I scream" and "ice- 
cream." This is done in both traditional and generative phonology by introduc- 
ing a juncture marker //// (/ai//skrim/ vs. /ais//krim/) . The phonetic changes 
accompanying a change In juncture have been studied by *Lehiste (1960) and by 
GSrding (1967) , who show that some changes in juncture are perceptually easy to 
distinguish, but they do not show directly which cues are actually used. One of ^ 
the most obvious allophonic changes that occurs with a change in juncture is 
that of the aspiration of voiceless stop consonants. When Helen entq;r8 in 
Act IV, scene 5, of Trojlus and Cressida , she is greeted by a fanfare and the 
shout "The •TroyAns' trumpet!" The aspiration of the' word-initial /t/ and the 
voicing of the word-final /s/ are the phonetic cues that potentially protect her 
from insult. Using politically less sensitive material, Christie (1974) has 
shown that the aspiration of voiceless stops is used as a cue to juncture, since 
adding aspiration* to a /t/ in the content /asta/ increases by about 30 percent 
listeners' judgments that the syllable boundary occurs afteY, rather than before, 
the /s/. Malmberg (1^55) has also shown that the presence of foi;?nant transitions 
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either before or after closure carl Influence whether a consonant is heard as 
coming after the first vowel or before the second vowel in a disyllable such as 
/ipi/. Malmberg's results have been amplified by Darwin and Brady (1975). They 
examine the cues underlying the distinction between "I made rye" and "I may dry, 
and find that pxovided the stop closure is quite short, appreciable formant 
transitions can occur after closure without shifting the juncture to before the 
/d/. The perceptual system here takes into account coarticulation effects 
across word boundaries that are also stop closures. Although careful listeners 
can reliably distinguish consonant clusters with different syllable boundaries, 
it is not clear hqw important this allophonic variation is in running speech. 
We do not know for example how much less intelligible running synthetic speech 
is that puts syllable boundaries in inappropriate positions. 

AUDITORY GROUPING' AND FEATURE EXTRACTION ^ 

How do we group together sounds that' are to be analyzed as a single source? 
The extensive literature on auditory selective attention has identified a num- 
ber of variables that contribute to our remarkable ability to listen to one 
voice among many. The best known are location and pitch. . 

Pitch . V . 

Fletcher (1929:196) was the first to show that the earaJ/ill fuse together 
souqds from different parts of the spe<:trum when they orig^^te from a common 
soAirce. Broadbent and Ladefoged (1957) amplified Fletchei^ finding and showed 
that fusion will occur when the sounds at the two ears h^e the same pitch; if, 
on the other hand, the signals that two ears receiyga^ amplitude modulated at 
different rates, then two sources will be heard, oneac each ear. The importance 
of this phenomenon in speech is clearly that this effect provides a mechanism 
whereby the different formants from a particular speaker can 'be grouped together 
as a separate perceptual channel from those belonging to other speakers, who 
will in general^ be speaking at a different pitch ^[Broadbent and Ladefoged, 
1957). The breakdown of this mechanism can be noticed both in the concert hall 
when two different instruments play at the same pitch and the resultant timbre 
sounds like neither (Broadbent and Ladefoged, 1957), and more esoterically. when, 
in dichotic listening experiments, two different synthetic speech sounds with 
the same pitch ^are led one to each ear. Here the impression is of a single 
sound, but curiously subjects tend tp report the sound on the right ear more 
accurately, indicating that the autonomy of the two sounds is t;o some eixtent 
preserved (Shankweiler and Studdert-Kennedy, 1967; Darwin, 1969). A demonstra- 
tion of this autonomy-preserving fusion comes from Rand (1974) who played to one 
ear of each of his subjects a Stop-vowel syllaj)le from which the second- formant 
transition had been removed. This transitidtC which provided the only cue to 
place of articulation (/b/, /d/, or /g/), was led to the opposite ear but on the 
same pitch and in the correct temporal relaJJ^otnto the main stimulus. He found 
that subjects could distinguish the place ^ articulation of the consonant and 
hear an additional nonspeech noise in the eai^ receiving the transition. 

Grouping by a common pitch may be a special case of a more general phenome- 
non of grouping together sounds by time of onset (x^ith each laryngeal pulse 
marking a new event) , since the. normal fusion of two sounds with a common pitch 
-can be overridden if the two components start at different times. For example, 
if we first listen to formant 1 of a vowel and then add formant 2, both formants 
can be heard in the resultant timbre. This is not tirue of the vowel that can be 
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heard by starting both formants simultaneously.^ The concert hall again fur- 
"utshesa further illustration. Turning on a radio in the middle of a sustained 
orchestral chord yields a timbre that is not decomposable into its component 
parts, but:. that separates out as soon as instruments change notes at different 
pitches. 

Pitch plays a further role in auditory grouping at a more complex level. 
Bregman and Campbell (1971) have recently shown, although the principle was 
well-known to baroque composers, that when a sequence of six notes, of which 
three are high in pitch and three low, is played rapidly, the impression is of 
two separate tunes, one high and one low. This impression only occurs for fast 
piresentation rates (the crucial r^te dep^endlng on the pitch intervals employed) 
and is objectively confirmed by subjects being unable to make judgment between 
tunes of which note preceded which, although this ability is^good within tunes 
for notes the same time apart. This effect might derive from a speeih percep- 
tion mechanism, enabling a listener to retaain listening to a particular voice 
over periods of silence or pitchlessness in competition with other voiqes at dif- 
ferent pitches. A recent unpublished experiment by myself and Davina Siramonds 
su^por&s this idea. Simmonds asked her subjects to shadow a passage of prose 
presented to one ear with instructions to Ignore what was presented to the other 
ear. tinknown to the subjects, at some time during the passage the one they were 
suppose^ i;^ be shadowing might change to the opposite ear giving a semantic dis- 
contiduity on the shadowed ear. Treisman (I960) had previously shown that in 
thi^ situation subjects occasionally made errors in which they continued to 
shadow the same passage after it had switched to the "wrong" ear. Simmonds' s 
contribution was to show that whethe^ these errors occurred or not depended on 
whether the intonation pattern switched between the ears. If the passages were 
prepared by being read continuously so that intonation was continuous apross the 
semantic break, few intrusion errors occurred from the opposite ear, although 
subj'ects tended to hesitate. The intrusions did occur, though, when the intona- 
tion switched ears. Moreover, intrusions still happened even when there was no 
semantic break. Continuity of intonation thus seems to be an important factor 
in determining which part of the auditory input is to be treated as belonging to 
the currently attended channel. It remains to be seen how much this is due to 
short-term effects, such as those found by Bregman and Campbell (1971) and how 
much it is due to predictions of the e?qpected prosodic pattern based on more 
complex aspects of the preceding input. 
• i 

This extensive use of pitch in grouping auditory elements together suggests 
that it might be extracted by a different mechanism from that which is concerned 
with extracting information about the formant structure or timbre. This seems 
to be the case. Many speech signals have little or no energy at the fundamental 
(corresponding to the rate of vocal cord vibration) and yet the pitch is clearly 
heard. This problem, the missing fundamental, has indicated the need for a 
pitch mechanism other than 'the det;ection of place of excitation on the basilar 
membrane. Licklider (1951) first suggested autocorrelation as a possible mechan- 
ism for extracting this "periodicity pitch" and his idea has recently been re- 
vived by Wd^ghtman (1973), who claims that autocorrelation can handle a number of 
strange effects previously rather hard to explain (See Sm£lll, 1970). Autocorre- 
lation involves quite simply a comparison (correlation) oi the signal with itself 
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delayed by varying amounts. The correlation will be maximiim when one signal is 
delayed relative to the other by an amount equal \o the periodicity of the wave- 
form. Thinking of autocorrelation in this way leads to a realization, of a pos- 
sible mechanism in terms of a neural delay line (Licklider, 19^1) , but another 
mechanism that is mathematically equivalent is based on the ob&ervation th^t 
periodic signals have spectra with periodic peaks, the spacing between the peaks 
being related to the periodicity* Thus a device for recognizing regular patterns ^ 
of excitation along the basilar membrane rather after the manner of spatial 
Fourier analyzers in the visual system (Campbell and Robson, 1568) could also 
achieve autocorrelation. This latter type of mechanism is favored by Wilson 
(1973) and seems to have the advantage of achieving a first stage toward the re- 
quired grouping of the individual components, prior to analysis of the formant / 
structure. Autocorrelation has also been used as the basis or automatic pitch 
extraction devices that are less prone to errors (such as jurapj.ng up an octave) 
than traditional pitch meters, which operate on a principle closer to "place" 
theories (e.g., Lukatela, 1973). 

Location 

It has been known for some time that angular separation of auditory sources 
helps selective attention to one rather than the other (Broadbent, 1958) pro-, 
yided that other cues such as pitch do not override the usefulness of location. 
It appears also that localization is important in determining the effect that 
one sound can exert on another preceding it. If a consonant-vowel syllable is 
played to one ear followed (say 60 msec later) by another syllable, differing in 
the consonant, to the other ear, then the second consonant will be reported more 
accurately than the first (Studdert-Kennedy , Shankweiler, and Schulman, 1970) • 
The question now arises, if one sound masks its predecessor, \\pw do we ever per- 
ceive a continuous flow of speech? And, since the effect also occurs in non- 
speech tasks (Darwin^ 1971b), how do we perceive any rapidly changing sound? 
The answer perhaps lies in the observation that these masking effects are less 
easily obtained when the two sounds used come from the same location (Porter, • 
1971), when any masking that does occur tends to be more symmetrical, with less 
predominance of backward jver forward. Perhaps then, the auditory system is 
using location as a heuristic to decide whether two dounds are to be treated as 
part of the same gestalt or whether they should be distinguished and the process- 
ing of the first discarded in favor of the second. 

So far there Has been no indication that any of the mechanisms outlined 
exist exclusively for speech even though one might argue that they have arisen 
because of the needs of speech perception. This is not altogether surprising, 
but this distinction becomfes more important when considering subsequent stages 
of analysis. ^ ^ ' 

Adaptation 

The discovery of single cortical ceflls selectively sensitive to simply pro- 
perties of a visual (Hubel and Wiesel, 1962) or an auditory (Evans and Whitfield, 
1964) stimulus lent physiological credence to theories pf pattern recognition 
that suggested that the organism first detects basic stimulus properties of a 
perceptual object and subsequently uses this information as a basis on which to 
construct a percept (Selfridge and Nelsser, I960). It also renewed interest in 
perceptual aftereffects since some of these perceptual distortions could be 
^ neatly explained by appealing tOj the adaptation of detector units similar to 
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those .discovered electrophysiologically. The basic methodological axiom here is 
that repeated exposure to a particular stimulus weakenis the subsequent response 
of detectors that have responded to that stimulus. Thiff weakening, then causes a 
/ distortion in the subsequent perception of any stimulus that would normally be 

capable of exciting those detectors. The degree to which the perception of dif- f 
ferent test stimuli is affected by previous exposure to some other stimulus thus 
gives an indication of what types of properties are being detected by the sen- 
sory system (e.g., Blak^ore and Campbell, 1969). 

" • ■ 

This approach has been applied recently to auditoty perception. Kay and 
Mattfiews (1972) have found evidence in adaptation e^^periments for detectors 
sensitive to a tone that is frequency modulated at a particular rate, thus pro- 
viding one auditory analog to the suggestion by Blakemore and Canq^bell (1969) 
that the visual system contains detectors sensitive to luminance that is ampli- 
tude modulated at particular spatial frequencies. Experiments on adaptation to 
speech sounds have multiplied rapidly since a seminal experiment by Eimas and 
Corbit (1973). 

This study uaed stop consonants differing in voicing and with one of two 
different places of articulation. For each place of articulatio'ft^ .they construct- 
ed a continuum of sounds varying in VOT. Their subjects first identified iso- 
lated sounds from these two continua, then they adapted to a token of -/b/ , /p/, 
/d/, or /t/ taken from the appropriate end of the VOT cbntinuum by listening to, 
it 150 times in two minutes. The^f perception of the two VOT continua was then 
retested in a series of trials each of which involved listening. to a further 75 
presentations of the adapting stimulus (in one minute) followed immediately by 
a test stimulus. The results showed that irrespective of the place of articula- 
tion of the test and adapting stimuli, the voicing boundary moved toward the 
adapting stimulus, a slightly greater shift being found for voiceless th^n for 
voiced ''adaptors. In a subsequent paper > Eimas, Cooper, and Corbit (1973) showed 
tha't this shift in the voicing boundary persists if the adapting and test stim- 
uli are led to different ears. The effect, then, is central rather than periph- 
eral, but what type' of detector is responsible? Is it a linguistic feature de- 
tector specific to speech, or is it an acoustic feature detector that can also 
subserve nonspeech distinctions? Eimas and his colleagues tackled this question 
by using as an adapting stimulus a voiced stop-vowel l^lable with all but the 
initial 50 msec removed. This stimulus preserves the information that voicing 
starts at the beginning of the sound but does not sound like speech — "just a 
noise" in the words of the subjects. As predicted by the linguistic feature de- 
tector notion, adaptation to this sound gave no significant shift in the voicing 
boundary even when the subjects were instructed to hear the sound as speech. 
Howe>^r, Cooper (1975), in a review of the adaptation work, cites an unpublished 
study of Ades who does find some adaptation in this case. 

The question of whether the adaptation effects obtained were due to the 
adaptation of a complete linguistic feature or to adaptation of the particular 
cues that can subserve it has been pursued by Bailey (1973). He used a linguis- 
tic dimension with well-understood multiple cues. Place of articulation for 
voiced stop consonants can be cued by the second- and third-formant transitions. 
Bailey first showed that the adaptation effect could not be occurring at the level 
of a detector responding to place o^'.articuiation per se, since when subjects 
adapted to the syllable /be/ they showed more shift along the /be/-/de/ continu- 
um than along the /ba/-/da/ continuum (Figure 3 a, b; see also Ades, 1974a). 
They did show some shift in the latter case, and Bailey's second experiment 



75 




82 



4 





to. (d Q) 

0) CO 0 Mr« 3 4J 

^ o o u 

C Cd CO 4J"^ MH 
^ ^ J « 0) 

4J 4J 4J O ^ 1-f -H 
MH ^ MH I ='V C 

•H 60 -H f's'^ CO Cd 

CO n CO <d Jis ^ o rC 
4J M CO ;tj -H _ 

C (U^H^j-^M^ co 
cd fx ^ -H ^ o 0) *J 



4J CO 
•H 

•H 0) 

4J XO 

U-K'td 



I 

Q) 

O CO 

u 

Cd a 



CO 

u u 
00 cd 

O M 

•U CO 

a a 

0) rH 
(XrH 
CO -H 



M CO 



a5 



CO C 

o 

Cd 

CO § 
0) o 

> rC 

CO 



C CO 

C -H Cd 

•H rH 

Cd -H JCJ 

^ 4-" 
CO 

rC 

w u 



51 



B 

CO p 

cd O 
no 
Cd 



O 

CO S !° 
u o ^ 

" I 



> T» O 
^ cd^H 



CO 



§ 

•H 
U 
0) 

0) 



cd 0). 
Cd 



no 
Cd 

U 
0) 
4J 



CO 
0) 

> 4J 



Cd "^^^ 

no 

cd 0) 

rC 



Cd 4J 



Cd 

bono 
no 

/-N Cd 

^ 1-) H 



CO 0) 

M rC 

O H 
00 4J 
O 

M Cd •■ 
4J no CO 

a Cd >i 

CO a pu 
0) Cd 
O 0) no 
!2 pu cd 

•U CO 

a 

0) 
0) 

pu 

CO 



0) t)0 
= 13 

O -H 

0) rH CO rH . 

rC Cd Cd Cd B 

no -H CO 3 

MH M e)0 M a 

O no cd -H O -H 

43 o pu a 



0) rH 

rC Cd 

H M 

• Cd 
^ 13 a 
m -H 

no 4J 
0) 0) 

O CO rC 

13 3 *J 



4J 
X 
0) 
4J 

0) 
43 
4J 



43 



Cd 

0) 0) no 
H 43 Cd 

cd 4J 



M CO 4J 

4J Cd Cd 

43 43 

PL4 43 

a CO 

d -H .4J 

a rS 13 

CO 0 -H 

cd _ -H 4J 

43 0 M Cd 



CO 

o 

CO 
13 

o 



no 
0) 

u 
u 

0) . 
0) 



•H « CO 
i* 43 no 



00 

13 13 

•H 0) 

4J 0) 

§•2 
^5 



0) 
00 



o 

13 MH 
3 



4J 

CO 43 

0) a 



0) M rH ' 
g4 4J Cd'' 



0) 



•M 43 4J 

a (D 0 0) rH 

O U 0) 6n-^04J<H 

CO 0) MH O Cd " 

4J »w CO r MO 

0) cd 0) CO 4J 4J iH 

Ma)0a)Mcoco4J 

Q H 43 

0 bor 4J 



.O 0) 3, 

4J 4J rH* 



0) 

Cd 

0) 

00 
•H ' 



PL4 



76 



-Figure 3 



ERIC 

hiiiifiiiiTiffTiaaiia 



83 



suggested why. Here, rather than changing the subsequent vowel, the particular - 
formant transition used to cue the place distinction was changed. One set of 
stimuli used formant s 1, 2, and 4 (F4 had no transition) with the place distinc- 
tion being cued only by changes in the second formant. The other set of stimuli 
used all four formants but with a neutral F2 transition so that the pl.ac^ dis- 
tinction was being cued by changes in the third formant. Adaptation had a sig- 
nificant effect on the phoneme boundary when both adapter and test items j/^ere 
from the same s«t. However, when they came 'from different set's there was only 
an effect when the adapting stimuli were distinguished by a change in a formant 
that was present, if constant, in the test stimulus. When the adapting stimulus 
varied in F3 but the test stimulus had no F3, then no adaptation effect was ob- 
served (Figure 3 1) . Here there could be no processing of F3 transitions to 
reveal the Underlying distortion of such processing by the adapting sequence. 
In a subsequent experiment Bailey showed that some effect of adaptation returned 
if the test stimulus was given a flat third formant. 

*- These results give clear evidence that adaptation can occur at the level of 
detectors for specific acoustic features that precede the subsequent pooling of 
these features for a decision aboat the linguistic dimension. This conclusion 
has been confirmed by a number of recent studies. The boundary for place of 
articulation for stop consonants can be shifted by presentation of the isolated 
second and third formants (Figure 3 e, f, g; see also Tartter, 1975), by the 
isolated formant transitions (Figure 3 h; Ades, 1973; Tartter , 1975) , or by the , 
second- and third-f ormant transitions accompanied by an inappropriate, nonfepeech- 
like first formant transition (Tartter , 1975) . Provided, then, that the adapt- 
ing stimulus contains cues that distinguish the items on the test continuum, the 
adapting stimuli themselves need not be heard as speech sounds, as pointed out 
by Cooper (1975) . 

Adaptation of specific acoustic cues can also explain why adaptation fails 
to generalize between initial and final consonants. Adapting to /bae/ will shift 
the /baf-dae/ boundary but will h^ve no effect on the /aeb-aed/ boundary, and vice 
versa (Figure 3 J; Ades, 1974a). -The reason for this is that for the synthetic 
speech sounds used in this experiment the formant transitions that cued the 
initial stop consonant were mJgrror images of tholse that cued the final stops and 
Sio were presumably served by different detectors. This explanation has been 
strengthened by an experiment by Task (1974). Tash placed the formant transi- 
tions' appropriate for initial stops at the end of a steady vowellike sound whose 
formant values were the same as the start of the transitions. This gave non- 
speech sounds with the transitions in final position (Figure 3 d) . These 
sounds were effective at shifting the boundary for initial stops. Thus, provided 
the direction of the transitions is preserved, adaptation can occur between ini- 
tial and final positions, . / , 

Adaptation at the auditory feature level, though undoubtedly present, ^s 
not the whole story. Stop consonants, differing only in placfe of articulation, 
can be synthesized to have identical first-f ormant -transitions so that they are 
distinguished only by the second- and third-f ormant transitions. If adaptation 
were occurring only at the auditory feature level, we might expect that the 
presence of the , (constant) first-f ormant transition would be irrelevant to the 
adaptation effect of place of* articulation, since it dbes not c^rry any distin- 
, guishing information. Howeyer, it is clear that although the isolated second; 
and third formants or !their isolated transitions do shift the place-of-articula- ' 
tion ^e^undailry, this effect, is much less than if the first formant is included 
(Ades, 1973; Tai'tter, 1975), 
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To explain this dependency we peed tp consider levels' above the auditory - 
feature. Let us assume that a number of auditory feature detectors map in a 
hierarchical way onto some higher-level unit. For example, three rising formant 
detectors might map onto a detector for labial place of articulation or onto a 
particular syllable; at least one of th^^e, the rising f irst-formant detector, 
will presumably also map onto units at this higher level that have different 
places of articulation.^ — 

After adapting to a complete speech sound vitH all three formants present, 
all three of the auditory feature dete&tors will^have been fatigued, but after 
adapting to the sound without * the. first formant, only the second- and third- 
formant detectors will have adapted. There ate now two ways to explain the 
greater boundary shift with the complete stinjulus. . We tan assume that through 
some nonlinearity in the system the adapted f irst® formantf leads to a greater re- 
duction in the firing of the higher-level unit^ already^^eceiving a reduced input 
from other (adapted) detectors than in the firing ot ^ unit receiving unadapted 
input. Alternatively followl^ng Tartter (1975), we can assume that some adapta*- 
tion is occurring in the higher-level unit itself. Although on the basis of the 
data presented so far, the fiormeff hypothesi^^ is perhaps more economical, addi- 
tional adaptation at a higher level is made more likely by two types of study: 
split formants and cross modality. In the split formants experiment the first 
formknt of a syllable is led to one ear while the other two are led to the oppo- 
site ear. If the two sounds are played simultaneously, then the entire syllable 
is heard, but if they are played one 'after the other, then two nonspeech sounds 
are heard. In an ingenious experiment, Ades (1974b) showed that there was / 
greater adaptation when the sounds were presented simultaneously than when they 
were played at different timeg. Playing the components at different times should 
have no differential effect on the adaptation to the detectors of those compo- 
nents, but it would prevent adaptation to the higher-level cTategory since that 
is never heard. The greater adaptation for simultaneous presentation thus 
argues for adaptation also occurring at some level above tl>e auditory feature 
level* The second type of evidence comes from cross-modal- a^daptation. Repeated 
visual presentation of asyllable the subject reads silently gives a shift in 
the bouudary for voicing (Cooper, 1975), which is specif ic to position within 
the syllable <Eimas, in preparation) so that a visually presented "bae" will 
adapt the auditory /bae-dse/ boundary but not the /aeb-aed/ boundary. 

What then might be the level of this -higher category? We can put forward 
three candidates: the phonetic feature, the phoneme, and the syllabi^. A rium- 
ber of authors have proposed that adaptation effects can opcur at the phonetic^^ 
feature level (Ades/ 1974a; Cooper, 1974; Tartter, 1975), the argument being 
that there is some generalization for adaptation to place of articulation acfoss 
vowels (Ades, 1974a; Cooper, 1974; Tartter, 1974) and across manner of articula- 
tion (Cooper and Blumstein, 1974). But these arguments are not strong since 
this generalization can perhaps be handled by adaptation at the auditory level • 
(see Cooper, 1975, for a.discussion of these issues) . Cooper and Blumstein find 
significant shifts in the /bae^dae/ boundary after adaptation to /bae, mae, vae/ 
(Figure 3.c) but-vety little after adaptation to' /v^/ ( Figure 3 k). Since all 
these sounds except /vte/ contain' similar formant transitions, whereas /wse/ has 
slower transitions, the observed adaptation effects can be explained in terms of 
rate-specific formant transition detfctors. Rather more difficult to explain 
are the cross-vowel adaptation effects, bt^t these can perhaps also be handled at 
the auditory level. Both Tartter (1975) and Ades (1974a) have shown that adapt-- 
ing to. a stop consonant in front of one vowel gives some shift in the boundary 
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for place of articulation of a consonant in front of another vowel. Both of 
them used three formant syllables with the third formant common to both Vowels. 
Since third-formant transitions are cues to place of articulation, it is not sur- 
prising that some cross adaptation occurred. 

Indeed, in, Tartter's experiments it is possible to compare directly the ex- 
tra adaptation effect attributable to the syllables sharing a common phoneme 
while controlling for coihmon acoustic cues. Tartter finds that the third fomi^nt 
from a /bae/ produces a significant shift in the /ba^-dae/ and the /dae-gae/ bounda- 
ries. This shift is in opposite directions for the two boundaries, since /bae/ 
and Agae/ both have rising third formants, while /dae/ has'^a falling third formant. 
However, Tartter also examines the shift in these boundaries after adaptation to 
/bi/ and /gi/. These stimuli are of interest since they toth havfe the same 
third formant as /ba/ , but different first and second formants. On ^purely audi- 
tory, grounds then we would expect /bi/ to have the same effect on aVtse-dae/ 
boundary as the /bae/ third formant, while if some adaptation were occurring at 
the feature or phoneme level, /bi/ should have a larger effect on the /bse-dae/ 
boundary than the isolated third formant. A similar prediction can be made 
mutatis mutandis for /gi/. In fact, the average adaptation effect when the 
phoneme is shared is. only 20 percent greater than when auditory factors algne 
are contributing to the adaptation. - 

» » 

If adaptation is occurring at some level beyond the auditory feature, then 
it would appear that the most appropriate level, at least for stops cued by for- 
mant transitions, is the syllable . This would explain why an entire speech" 
sound is a more effective adaptor than the acoustic discriminanda alone and also 
why cross-modal adaptation is specific to initial or final position. Assuming 
that some adaptation occurs at both an auditory and a higher level also aldows 
one to explain why the /ba&~dae/ boundary does not shift after adaptation to /gae/ 
despite significant shifts being found after adaptation to the third formant 
that /bae/ and /gae/ have in common (Cooper, 1974; Tartter, 1975). 

There ajre a number of lines of evidence to suggest that auditory feature 
detectors are specific to a sound's spatial location. Ades (1974b) found that 
he could adapt the two ears simultaneously to different sounds using dichotic 
presentation. Thus, one ear received /bae/ at the same time as the other' ear 
heard /dae/. The direction of the shift in the /bae-dae/ boundary was different 
for the two ears. Ades also showed that after adapting to a sound presented to 
one ear alone, there was incomplete (55 percent) transfer to the opposite ear. 
Although Ades interpreted these results in terms of peripheral versus central 
adaptation, it would be equally valid to interpret them in terma^ of location- 
specific auditory detectors. Recent support for this notion cojAes from an ex- 
periment on the verbal* transformation effect (Warren; 1968). I^peated listening 
to a word causes it /to lose its meaning and change its sound so that it is per- 
ceptually transformed into another word. It seems likely that at least part of 
this effect is due to adaptation at the acoustic levelL (see Lackner arid • 
Goldstein, 1974). . Warren and Ackroft (1974) have shown ire«:er\,|^ly tba^ if the 
same word is presented to the two ears>but slightly offset iilrtim^ so that two 
distinct utterances are heard, then different transformation^ can be heard in 
the two ears. This suggests again that there are different' sets of detectors 
for different spatial locations capable of being differentially adapted. 

Such wanton proliferation of' auditory detectors might seem unwarranted and 
unnecessary, but without multiple detectors for identical sounds it is difficult 
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to see how two separate streams of speech could be handled at the same time. 
That this does appear to be the case is. shown by studies of selective attention. 
If a subject has to shadow a prose passage read to one ear while at the same time, 
trying to make a manual response whenever a target wdrd is played into either 
ear, he will fail to detect the vast majority of targets oti the unattended ear, 
while successfully responding to those on the shadowed ear (Treisman and Riley, 
1969). However, if the subject is conditioned before the experiment by being 
subjected to an electric^ shock each time he hears a word belonging to a particu- 
lar semantic category and subsequently has to respond manually to words belong- 
ing to that semantic category while shadowing, then although virtually none of 
the words on the unattended ear produce a manual response, over a third give a 
galvanic skin response (Corteen and Wood, 1972). Thus, although the semantic 
properties of words on an unattended channel rarely reach consciousness, it can 
be shown that their semantic properties have been extracted. Some basic percep- 
tual processing can occur for different speech streams at the same time. 

In summary, then, the work on adaptation gives good evidence for detectors 
tuned to complex auditory patterns, such as particular formant transitions that 
may exist in multiple sets, each set being maximally responsive to sounds from a 
particular location. The adaptation work gives evidence also for units at a 
more complex level than the auditory feature. While it is not yet clear what 
level these additional units are at, it is, suggested that the available evidence 
is not incompatible with formant transition information being mapped directly 
onto a 8yllal;>ic unit. For more discussion of these points and ^Iso a review, of 
the adaptation work on voicing, see Cooper (1975). 

As a cautionary footnote to the work on adaptation, it is passible that 
some of the phoneme boundary shifts found in adaptation experiments may be due 
to factors other than the adaptation of various detectors. In particular, it is 
possible that some of the observed, effects *can better be looked on as criterion 
shifts brought about by a change in the range of stimuli re'cently presented to 
the subject. Brady and Darwin (in preparation) have found that when subjects 
have to classify stop consonants presented in blocks of 16 trials during which 
all the stimuli come from a subrange of the voicing continuum, the phoneme 
boundary moves as a function of the location of the subrange used in that block 
(Figure 4) and the subrange usedvin the preceding block. This is true whether 
the subjects have heard sounda from the entire range at the beginning of the ex- 
periment or not (although the effects are larger when they have have not). The 
direction of the shift is such that sounds near the boundary are heard as being 
more voiceless when they occur in a range that extends toward the voiced end. 
However, it is unlikely that these range effects can themselves be explained by 
adaptation since Sawusch, Pisoni, and Cutting (1974) have failed to find a simi- 
lar shift in the voicing boundary when they varied the probability distribution 
of the stimuli presented, rather than their fange. These authors used two dif- • 
ferent probability distributions, one in which the most voiced stimulus was four 
times as likely to occur as any of the other stimuli to be identified and an- 
other in which the most voiceless stimulus was the more prpbable. Their experi- 
mental design differs from our range experiment in that' it always includes some 
tokens from each end ,of the range. Since the number ot these tokens is very 
small, it seems unlik(^ly that they could be causing much of a change the 
adaptation state of property detectors, rather we should perhaps concTStie that 
they are sufficient to cause a shift in some criterion setting us^d to evaluate 
the phonetic significance of the output of the property detectors. 
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Dlchotlc Masking 

Although less well-developed than" the wor^ on adaptation, some experiments 
on dichotic masking are compatible with the idea that the extraction of acoustic 
features can be interrupted by the subsequent presentation of „ another sound that 
shares similar features. 

The Initial impetus to this work came from the previously mentioned experi- 
ment by Studdert-Kennedy, Shankweiler, and Schulman (1970) , wrhich showed that 
when two different stop-consonant-vowel syllables- were presented one to each ear 
with a temporal offset between them, the second syllable tended to be reported 
more accurately than the first, over a range of offsets from 15 to 120 msec. 
Although initial interpretation of this effect was in terms of the interruption • 
of some special rspeech processing device, on the grounds that vowel sounds were 
less prone to this masking (Porter, Shankweiler, and Liberman, 1969), it now 
seems likely that the effect arises at the acoustic feature extraction stage 
after the initial grouping process. The reasons for this are first that the ' 
vowel/consonant dichotomy seems irrelevant, since Allen and Haggard (1973) have 
shown, confirming a prediction by Darwin and Baddeley (1974), that acoustically 
similar vowels suffer greater ^backward than forward masking, whereas acoustically 
different vowels do not. Second, greater backward than forward masking occurs 
for sounds distinguished by rapid transitions, irrespective of whether these 
transitions cue a linguistic distinction or not (Darwin, 1971b). Third, Porter 
and others^ have found that the stop consonant in a CV syllable can be masked by 
subsequent contralateral presentation of the syllable's second formant; again, 
this shows greater backward than forward masking. 

Reasons for supposing that these nlasking effects are due to the interrup- 
tion of the extraction of some acoustic feature rest on analogies with visual 
masking. Here it is possible to distinguish between two types of masking: 
integration and interruption. Integration masking, which is at a prior stage to 
interruption masking, is more evident if ^target and mask are presented to the 
same eye; it then depends mainly on the relative energies of target and mask and 
to a much smaller extent on the contour relationship between the two. On the 
other hand, interruption masking occurs equally whether target and mask are pre- 
sented to the same or to different eyes, and appears to be independent of the 
relative energies of the target and- ma^k, but does require that the two share 
similar contours (Turvey, 1973). Darwin (1971b) pursued the analogy between 
auditory and visual masking 'by distinguishing be^tween the sets of sounds used 
for the target and the mask. Previous experiments had confounded the two by 
drawing both target and mask from the same set" of sounds, and then asking the 
subject to report either both sounds or the one on a particular ear (Kirstein,, 
1970). Darwin used the syllables /be, pe, de, te/ as a target set and one of 
four masks (/ge, e, o/ and nonspeech steady-state timbre) presented on the oppo- 
site ear either 60 msec in front of or .behind the target. He found that for 
place of articulation the amount of forward masking was the same for all the 
masks and rather small, on the other hand the amount of backward masking did de- 
pend on which mask was used, being very much greater for the /ge/ mask than for 
the others, which' showed only minimally greater backward than forward masking. 
The perception of voicing showed only slightly greater backward than forward 
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masking and this was the same for all the masks used. As in vision, then, there 
V is greater backward over forward masking only when the mask shares some features 
with the target. 

The question of what type of features the target and mask must share has 
been taken up by Plsoni and McNabb (1974; Pisoni, 1975). Their experiment uses 
as a target set /ba, da, pa, ta/, and six different masks /ga, ka, ga^, kae, ge, 
ke/ . They too find that the amount of backward masking depends on the similar- 
ity between the target and the mask, backward masking is much greater when the 
mask has the same vowel as the^.targets . 

This result is compatible with the idea that backward masking is sensitive 
to the particular auditory features present in the consonant .rather than to its 
phonetic features, since only the former vary when the vowel is changed. .How- 
ever, in a subsequent experiment they show that similar results also apply in < 
forward marking, when the mask precedes the target. This result causes them to 
interpret both their experiments' in terms of an integration hypothesis, since it 
is crucial for the interruption' hypothesis that backward masking should be 
greater thaft forward masking, but it is possible to accommodate the effect of 
targeft-mask similarity' into an integration theory. This interpretation is 
appropriate for Pisoni and McNabb *s (1974^ results (although their data do show 
a slightly greater effect of backward masking) but it cannot be used as Pisoni 
(1975) claims as an interpretation of those results that have shown appreciably 
greater backward than forward masking. - 

How can we reconcile these results? It is clear from the experiments that 
do show greater backward than forward masking that some forward masking is 
occurring (Studdert-Kennedy et al., 1970), so it may not be inappropriate to 
suggest that some integration masking is present. Indeed this would not be sur- ' 
prising since the sounds used in all these experiments have a common pitch and 
so are likely to fuse together to some extent, depending on their onset asyn- 
chronies. The important question is why some experiments show much more back- 
ward than forward masking (Kirstein, 1970; Darwin, 1971b; Porter, 1971; Studdert- 
Kennedy et al., 1970) whereas- others do not (Pisoni and McNabb, 1974). The an- 
swer probably lies in the 4ifferent tasks required of the subject. Pisoni and 
McNabb 's experiment is unique in that the subject always knew onto which ear the 
target would come and whether it would precede or follow the mask; whereas in , 
all the experiments that have shown greater backward than forward masking, the 
subject was in doubt either as to the ear on which the target sound would arrive 
or as to whether it would be the first or the second. The presence of both these 
physical cues may then allow the subject to stop the second sound from interrupt- 
ing the processing of the first. 

In summary, the data on dichotic masking of speech sounds suggest that both 
integration and interruption processes are occurring and that in both types the 
similarity of the target' and the mask can affect the amount of masking obtained. 
However, in circumstances where interruption masking is clearly occurring (back- 
ward masking greater than forward) the evidence suggests that it is occurring at 
the level of auditory features. Little" more can be inferred at present trom 
these experiments about the nature of these features. It is also unlikely that 
the masking experiments will provide as, direct, an access to them as adaptation 
experiments' have, both because of the different types of masking that may occur 
and because of the contribution that other factors such as echoic memory may 
play in performance on ma'sking experiments (Allen and Haggard, 1973; Darwin and 
Baddeley, 1974). However, the ease with which backward masking effects are 
shown* for rapid acoustic transitions, compared with the unreliable and small 
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effects for vowels and pure tones (Massaro, 1970; Pisoni, 1972; Cudahy and 
Leshowltz, 1974), suggests that the auditory features are likely to be complex 
or time-varying. 

Objective methods then are beginning to -provide a way of defining psycho- 
logically important auditory features. An interesting question is whether these 
will turn out to be., the same as those that seem to be important for other rea- 
sons. Fant (1964) stresses the importance of looking in the speech signal for 
"auditory patterns" that might serve either as a direct indication of a segment's 
identity (as in /s/) or, more usually, would provide the raw materials for the 
more complex processes required to interpret thetn in terms of linguistic cate- 
gories. 

MODELS OF SPEECH PERCEPTION 

We are now in a position to see the magnitude of the problem posed by speech 
perception, even at the phonemic level. Because of effects such as coarticula- 
tion the speech signal resistor any Effort to segment it into acoustically de- 
fined portions that are influenced only by a particular phone, except in a very 
restricted set of cases. Some segmentation is possible according to purely 
acoustic criteria (Fant, 1967) and we have seen that there is growing evidence 
that auditory features are extracted as part of the perceptual process. But 
where do we go from here? What type of process mediates* between the auditory 
feature and the phonetic category? 

Formant transitions do not provide a simple invariant cue of the form "a 
slightly rising transition .lasting 40 ms" (Cooper et al. , 1952), but can we say 
that .they do not provide a set of invariant exclusive disjunctions of the form 
"a falling second formant; lasting ^iO msec ending around 800 Hz OR a slightly 
rising second formant lasting 40 msec ending around 2700 Hz, etc.?". Rand (1971) 
has provided a simple dem6nstra4:ion that thig type of invariance is not suffi- 
cient. He constructed two sets of synthetic syllables whose vowels had the same 
second formant but different first formants, in one case the vowel sounded like 
an /ae/ produced by a child, and in* the other, an /e/ produced by an adult. Each 
of these vowels was preceded by formant transitions to give the three stops /b, 
d,g/. Rand found that the best transition for a /d/ response varied as a func- 
tion of the apparent vocal-tract size. The significance of a particular second- 
formant transition thus depends on the interpretation of the formant pattern as 
a whole, rather than on the value of the formant that it lea^s to. 

The main dichotomy in models of speech perception has been between active 
(analysis-by-synthesis) and passive models. The distinction between these two 
types of model is that while the passive model sees the primary categorization 
process as being due to some filter network, whose decision criteria change rela- 
tively slowly with time (Morton and Broadbent, 1967), the active model sees it 
as being the result of matching the input signal to an internally generated rep- 
resentation that can change rapidly relative to that signal (Liberman et al., 
1962; Stevens and H6use, 1972). 

Active Models 

One of the motivations behind active models of speech perception (e.g., 
Liberman et al. , 1962; Stevens and House, 1972) is that of economy. Given that 
the processes of speech production and perception- are both highly complex and 
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formally similar, would it not be an economical solution to combine the two? 
Economy of description, though a fundamental criterion for the linguist, is not 
an infallilJle guide for the psychologist , and indeed there is no shortage of evi- 
dence for^the independent. operation of perceptual and production mechanisms. 
Ldstening^to speech while speaking oneself is commonplace but is not readily ex- 
plained away by active models. The more extreme example of simultaneous trans- 
lation from one language to another (Neisser, 1967:217) illustrates this point 
well; here the very language of perception and production are different. Other 
examples of independence appear from the clinic. Patients with the two hemi- 
spheres separated by cutting the corpus callosum can show by activities of the 
left hand that they comprehend instructions presented to the right hemisphere, 
but they cannot show this by speaking (Gazzaniga and Sperry, 1967). Congenitally 
' anarthric individuals appear to have normal speech perceptual abilities 
(Lenneberg, 1962), and children, who by virtue of an articulatory defect are un- 
able to make a particular articulatory distinction, show no corresponding per- 
ceptual impairment (Haggard, Corrigall, and Legg, 1971). Experiments such as 
these suggest that the ability to perceive speech comes through "the distinctive- 
ness of the speech wave which we have acquired by being exposed to language in 
the first place and by reference to our own speech only in the second place" 
(Fant, 1967:113). ^ / 



Combioing the mechanisms of production and perception also offers no way of 
accounting for variables that change beXiween speakers. That interspeaker varia- 
tion is a significant perceptual problem is indicated both by experiments on 
speaker normalization (Ladefoged and Broadbent, 1957) and by the impaired per- 
formance of automatic speech recognition devices when tested on more than one 
speaker (see Hyde, 1972, for a review). 

Although the problem of the lack of acoustic/phonetic invariance is cited 
as one reason why an active model is needed, it is not clear that such a model 
will solve the problem. The trouble lies in deciding what is to be compared 
with what. The acoustic si,gnal is; presumably represented in terms of auditory 
parameters, while the internally generated articulatory representation is in 
terms of articulatory parameters. Before any direct comparison can be made 
there must be some translation between the two. To get round this problem, 
Stevens and House (1972) propose that a "catalog of relations between acoustic 
and articulatory instructions [of approximately syllable length] is built up in 
a child at an early age... As the child produces new and different articulatory 
movements, new items are added to the catalog" (p. 53). But if the mature 
speaker has this catalog, why bother with analysis-by-synthesis at all? Could 
you not simply look up the acoustic pattern in the catalog? Anaiysis-by-synthe- 
sis itself gets us no closer to solving the invariance problem. 

Analysis-by-synthesis is also seen as a way of using the constraints of the 
language to aid perception. This is dealt with later, but, in anticipation, ex- 
perimental evidence indicates that context is not used in the way suggested by 
active models, at least at the word level. 

Both the Haskins model (Liberman et al. , 1967) and the Stevens and House 
(1972) model emphasize that different perceptual mechanisms are needed for 
phonemes that do or do not have invariant acoustic cues. Stevens and House per- 
form a preliminary analysis on the signal in order to guide the subsequent syn- 
thesis, while the Haskins model allows only the more variable or "endoded" sounds 
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to engage the speech processing mechanism. The perceptual significance of this 
dimension of "encodedness" has been claimed from a range of different experimen- 
tal paradigms. These experiments have shown that stop consonants (the least in- 
variant phonemes) produce different results from vowels (the most invariant 
phonemes), while other phonetic categories fall somewhere in between. 

Perception of Different Phonetic Categories 

The earliest demonstration of this difference between phonetic classes was 
a set of experiments comparing categorization and discrimination of synthetic 
speech sounds. Since the second-f ormant transition can act as a sufficient cue 
for initial stop consonants, a continuum of sounds can be constructed varying in 
the extent and direction of this transition. Subji^cts will then label with some 
consistency sounds taken from this continuum as /b/, /d/, or^/g/, depending on their 
position along it. If pairs of sounds adjacent on this continuum are then played 
to subjects, their ability to discriminate between the members of a pair will be 
rather poor unless the pair happens to straddlie the boundary between sounds 
labeled as different phonemes. For a continuum consisting of vowels, on the 
other hand, discrimination is good throughout the continuum (Liberman; Harris, 
Hoffman, and Griffith, 1957; Fry, Abramson, Eimas, and Liberman, 1962). Other^ 
paradigms that have shown differences between stops and vowels include lateral- 
ity and dl^chotic masking experiments. After simultaneous dichotic presentation, 
stop consonants are recalled- more accurately from the right than from the left 
ear, whereas vowels show the effect less consistently (Shankweiler and ^tuddert- 
Kennedy, 1967). Similarly, if two different sounds are led one to each ear but 
with a temporal offset of around 60 msec, the second sound is recalled more 
accurately than the first for stops, but not for vowels (Studdert-Kennedy , 
Shankweiler, and Schulman, 1970). Other classes of speech sounds have given re- 
sults in the laterality paradigm intermediate between stops and vowels. For 
example, place of articulation for fricatives is cued mainly by the spectrum of 
the^ friction, but intelligibility is increased If ai^propriate formant transi- 
tions are added (Harris, 1958). The friction ±.k a comparatively invariant cue 
to place of articulation, whereas formant transitions are more variable. In 
keeping with the predictions for active theories, Darwin (1971a) found that 
place of articulation for fricatives was only reported better from, the right ear 
if formant transitions were present. Similarly, Cutting (1972) has shown that 
liquids (/r,l/), which can be regardSB as having an intermediate amount of in- 
variance, show an ear difference between th^t for stops and vowels. 

That these experimental differences should be attributed to the relative 
amounts of invariance or encodedness of different phonetic classes has been 
questioned. Fujisaki and Kawashima (1968) offered an alternative explanation of 
the discrimination experiments. They observed that the discrimination of short- 
duration vowels showed clearer peaks at the phoneme boundary than did long-dura- 
tion vowels. On the basis of this evidence, they proposed that performance in a 
discrimination task is determined both by the categorization process and by un- 
categorized information held in a buffer store. Pairs of sounds that differ in 
terms of the categorization process can be judged different on that basis, but 
if they are categorized as the same phoneme, then comparison is made between 
'their representations in the buffer store. Fujisaki and Kawashima showed that 
short vowels were perceived more categorically than long vowels. They suggested 
that a less accurate comparison could be made between the buffer store represen- 
tations of stop consonants and brief vowels than between those of long vowels on 
account of their duratiort. This result has since been confirmed by Pisoni 
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(1971), who has also shown (Pisoni, 1973) that the accuracy of discrimination 
between pairs of long- and" short-duralfion vowels decreased as the interstimulus 
interval increased, whereas the discrimination of pairs of stop consonants r:e- 
mained stable with time. There was a marked difference between the within- 
category discrimination scores for stop consonants and for vowels of the same 
duration. Clearly, cue duration of itself is not an adequate explanation of the 
discrepancy. One explanation of these effects (Liberman, -Mattingly, and Turvey, 
1972; Pisoni, 1973) is that a special mechanism responsible for the perception 
of stop consonants precludes the subsequent use of auditory information for non- 
phonetic judgments. If this explanation is valid, then Fujisaki and Kawashima's 
(1968) model needs to be altered to prevent auditory information being used 
after it has been categorized in a particular way. Thi^ explanation also rend- 
ers the hypothesis of a special processing me^chanism for st6p consonants immune 
from attack ^long the lines proposed by Fujisaki and Kawashima. 

This question of the relatfbnship between tfle categorization process and 
the buffer memory trace from which it is derived- has be^ examined in a differ- 
ent context by Darwin and Baddeley (1974) . As a result of experiments on acous- 
tic memory based on recency effects In immediate serial recall of " lists of items 
(Crowder and Morton, 1969^ Crowder, 1971), they proposed that acoustic memory is 
not influenced by the categorization process. Rather, it is simply an analog 
representation of the acoustic stimulus, which becomes degraded with the parsing 
of time (Darwin, Turvey, and Crowder, 1972). The resul-t of this degradation is 
that acoustically fine distinctions are lost before acoustically coarser ones. 
Darwin and Baddeley (1974) suggested that the various experiments that had pur- 
ported to show differences in categorization mechanisms for different phonetic 
classes merely reflected differential contributions from acoustic memory because 
of the different , acoustic confusability of items within diffe;:ent phonetic 
classes. Little useful acoustic information about place- of articulation for a 
stop consonant could be obtained from acoustic memory a short time a^ter its 
arrival, not because it has been categorized as a stop, but because It is acous- 
tically very similar to other stops with different places of articulation. How- 
ever, this information will .be less useful in distinguishing between different 
stop consonants than will similarly degraded ihformatpLon, which need only be put 
into acoustically coarser categories. ^ 

According to this- account , the reason some speecih souncis show laterality 
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effects where others do not is because the vocabular 
sufficiently acoustically different for useful info 
time in acoustic memory. Thus the left hemisphere i| 
gorize a left-ear signal, which, by virtue of poore 
hemisphere, is degraded compared with the right-ear signal (Darwin, 1973). Sim- 
ilarly, the reas6n some sounds show more backward m^^sking than others is because 
for acoustically distinct vocabularies the categorization mechanism can use the 
information in acoustic memory to take a second pass at a previously interrupted 
categorization. This hypothesis predicts that there should be a three-way cor- 
relation between laterality, dichotic masking, and acoustic memory experimeiits ; 
so that the grea*:est evidence of acoustic memory (in, for example, recency ex- 
periments) is given by vocabularies of sound showing the least laterality effects 
and the least dichotic backward masking effects. This correlation has been 
shown in a number of experiments. 

Acoustically similar vowels (such as /i , e ,ae/) show little evidence of useful 
acoustic memory in recency experiments (Darwin dnd Baddeley, 1974) whereas 
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acousticvally distinct vowel^ (such as /i^da^u/) do.. Similarly, acoustically 
similar vowels show a significant right-ear advantage while acoustically dis- 
tinct vowels do not (Godfrey, 1974)', and acoustically similar vowels show ipore ^ 
dichotic backward masking th^n do acoustically distinct ^effies (Allen and Haggard, 
1973). Syllable-final consonan^ts show more recency than syllable-initial con- 
sonants if they are acoustically distinct (/g,/,m/; Darwin and Baddeley, 1974) 
but not if they are acoustically similar (/b,d,g/; Crowder, 1973), and likewise 
syllable- final consonants show more tight-ear advantage than syllable-initial if 
they are- cued by slow transitions (/r,l/; Cutting, 1972) but not if they are • 
cued by fast ones (/b,d,g/; Darwin, 1969), In stop consonants, voicing shows 
less backward masking than does place of articulation (Darwin, 1971b) buf more 
recency.^ Adding appropriate formant- transitions to frit^Ltives makes them show 
a right-ear advantage for place of articulation (Darwitv, 1971a) but gives no in- 
crease in the size of their ^eefency effect (Crowder, 1973). / 

The success of this hypothesis rests not only on showing that the utility 
of auditory information depends on the acoustic similarity of the items used 
gather than on their phonetic class, but also oh showing that under suitable 
circumstances auditory information is available fron), soundb belonging ,to acous- 
tically similar categories, such as the stops. An experiment by Pi^oni and Tash 
(1974) gives direct evidence that some auditory precategorical information is 
available from stop consonants. They used a same-different reaction-time para- 
digm (Posner and Mitchell, 1967) measuring subjects' reaction ' times to pairs of 
sounds drawn from a continuum between /b/ and /p/''- Subjects had to decide 
whether the two sounds were the same phoneme or not. Their reaction times ' 
showed that it took them longer to decide that the twd sounds were the same 
phoneme when the sounds were physically slightly different (but still within the 
same. phoneme category) than when they were identical. They also found that the 
time to decide that the two sounds were different was faster wh^n the sounds 
differed by a larger distance on the continuum than a small, even though the 
sounds always fell within different phoneme categories. Some precategorical in- 
formation must have been available to the subjects over the half-second or so 
that Separated <the onsets of the two sounds. - ~* 

' It would be premature then to claim that we have any p-sychological evidence 
for different phonetic classes of sounds being perceived by different t3rpes of 
categorlzinR mechanisms. This does not mean of course^that there are no differ- 
ences, it merely shows that the paradigms used so far ate not sensitive to what 
differences there may be. Deprived of this empirical support, active models be- 
come less plausible. But in rejecting the active model a3 a miechr^nism for cate- 
gorization, we must be careful not to reject it as a statement of the problem. 
The active model is correct in emphasizing that knowledge of the vocal tract 
must be used in order to categorize speech, b\it JLt. appears po be incorr^t in 
suggesting that the mechanism by which this is done is an active analysis-bly- i 
synthesis.' Failure to use knowledge of the mechanisms and acoustics of the 
vocal tract is ^n important reason why contemporary attempts at maphine recogni- 
tion of speech have been unsuccessful (see Hyde, 1972, for a review). The suc- 
cess of programs of analyzing visual scenes related to the sophistication of 
the geometrical constraints, they have employed (Guzman^, 1968; Clowes, 1971; 
Mackworth, 1973). Perhaps we can expect a similar improvement in speech 
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recognition with the use of more sophisticated knowledge about the vocal tract. 
Alt^hough this knowledge exists in a variety of forms, including programs to Byn- 
'thesize speech by rule (Holmes et al. , 1964; Mattin^ly, 1968; Kuhn,. 1973), it 
has not yet been applied systematically to perceptual problems. j 

i 

Computational procedures exist that allow the cross-sectional area function 
of the vocal tract to be estimated directly (without recursive procedures) from 
the acoustic waveform (Atal, 1974). The advantages to the perceptual system of 
performing such ja. transformation are clear in that many of the problems raised 
by coarticulation evaporate; but is it likely that this is a first stage in per- 
ception? For reasons outlined by Haggard (1971), such a process would be more 
likely to appear in the perception of vowels where the spectrum id simpler, than 
for consonants with their additional sources of ajeoustic energy. Haggard sug- 
gests that the acoustically complex consonants may be perceived by heuristics 
that map directly from acoustic features to phonetic categories (as we have seen 
suggested for voicing),- while vowels may use a procedure that computes something 
like the vocal-tract area function. Evidence from vowel perception, however, 
indicates thaf if the perceptual process passes through an articulatory repre- 
sentation of the vowel, this is perceived heuristically , or by "rules of thumb" 
that do not achieve a representation as detailed as a vocal-tract area function. 

Carlson, Gratis trom, and Fant (1970) report the results of experiments in 
which subjects are asked to adjust' the second formant of tworformant vowels to 
match vowels composed of four formants. The value of this F2' lies between the 
values of the matched vowel's second and third formants (except for [i:], where 
it is above F3) in a position that Carlson, Fant, and Granstrom (1973) show is ^ 
predictable from the output of a model of the cochlea. The"" finite bandwidth of ^ 
this model causes the second-f ormant peak to be influenced by higher formants. 
A possible articulatory correlate of F2' , suggested by Kuhn (1975), is that it ^ 
may indicate the length of the mouth cavity (the cavity anterior to the point of 
maximum tongue constriction), at least for the more constricted vowels. Kuhn 
also suggests that emphasis on the mouth cavity as a perceptual variable may 
help in speaker normalization, since Fant (1966) had deduced from acoustic data 
that the difference in shape (as opposed to size) between male and female vocal 
tracts lies more in the pharynx than the mouth cavity, which is more linearly 
scaled between dif ferervt-^zed vocal tracts. The implication of all this is 
. that the perceptual system uses much cruder information than is needed to spect- 
fy completely an aroa function, . to perceive a vowel category. 

There is another reason Why a heuristic approach to*vdw6l perception might 
be advantageous. Stevens (1972) has shown that places of articulation for con- 
sonants and vowels occur at points where a small perturbation in articulation 
gives a minimum of perturbation in the acoustic output. If the perceptual sys- 
teift.were capable of deriving an exact area function from the acoustic data, then 
this choice could ibe a disadvantage. However, ife^ perception operates heuris- 
tically, the advantages of a sloppy tfrticulation^V^oducing a relatively stable 
acoustic output, which is tffien majiped onto some idealized articulation, becomes ^ 
obvious. Studies of articulation under abnormal conditions, such as might occur 
with a pipQ held between the teeth (LiLndblom, 1972) , indicate that the articula- 
tion is drastical^ly changed^ in order to attain a more nearly constant acoustic 
result, t In addiction," X-ray studies of vowel articulation (Ladefoged, DeClerk, 
Pap^un, and Llndau, 1972) -show that different speakers use very different tongue 
positions to produce the' same phonetic vowel; again suggesting that some acoustic 
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criterion must be satisfied. If, then, perception is mediated by articulatory 
variables, this artlculatl-onr-is^ unlik e ly to be equ i valent to that of- the speak - - 
.gSrLTlgis^a ppllcat i on, as AtalJ.s.. .(197A:>^ would lead ..tOTr^or^ t o ±hat- 

of the listener — as implied by motor theories. Rather, we must assume some more 
abstract foxm that . is neither .subjp.r.t to fheir limitations nor capable of their 
variations. 



Context and Prosodic Variables 

Although this paper has concentrated on the problems of speech- perception 
at the phonemic level, it would be misleading not to mention the further compli- 
cations introduced by considering the perception of speech over segments longer 
than the isolated syllable or word. ' 

Normal continuous speech varies from word to word in the precision with 
which it is articulated, and there appears to be an intimate t^lationship be- 
tween the articulator^ precision afforded a word and the ease with which that 
word could have been predicted by the listener. A word isolated from a predict- 
able context is not as intelligible as the same word isolated from a less-pre- 
dictable context (Lieberman, 1963). The complement of this observation^is that 
listeners can use context to make up for poor articulation (Rubenstein and 
Pollack, 1963; Lieberman, 1967). 

The ability to use context as an aid to perception is on,e virtue of an 
analysis-by^syn thesis model; of perception, but an impressive quantitative 
account of the interaction between context and stimulus information has^ been 
given within a passive framework by /Morton (1970) . In this mocjiel (see Figure 
5) , contextual constraints are seen as imposing a variable criterion on the de- 
cision mechanisms underlying word recognition, so that expected words are sub- 
ject to a laxer criterion than are unexpected, and so they require less sensory 
information to produce a percept. Morton (see also Morton and Broadbent, 1967), 
contrasts this type of model with active models, which he maintains would pre- 
dict a change in sensitivity rather thqn a change in criterion letting. This is 
presumably because tl>e more expected word would be put up as a candidate to the 
analysis-by-synthesis comparator fe^arlier than the less expected word, and so 
would find the stimulus trace in a less decayed or overwritten form. 

But there is. more to the percep^on of connected speech thaki the use of the 
context it supplies, for if words spol^en in isolation are concatenated, the in- 
telligibility of the resulting speech is very low (Stowe and Hampton, 1961) . As 
Huggins (1972b) has observed, this is particularly striking when one considers 
that the ihtelligibility of the individual words that constitute this speech is 
higher than that of the same words spoken fluently. Presumably, then, percep- 
tion of connected speech does not proceed solely as a sequence of serial deci- 
sions of phonemic or word size. helped by phonological, syntactic, and semantic 
constraints, but is augmented by suprasegmental factors such as intonation and 
rhythm. Prosody obviously provides such information as wherQ stress falls in a 
sentence, whether a question is being asked, what the mood of the speaker i^, 
and so on (see review by Fry, 1970), but it perhaps also plays a more dynamic 
role in perception. It may serve to direct the listener's attention toward 
potentially informative parts of the^speech stream (Cutler, 1975) and to segment 
the stream into chunks that are then candidates for lilgher-level units of dnaly- 
^ $ . ^is. 
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Figure 5\ Morton's "Logo^i^.v^i^del . " While this model makes' no ' attempt to de- 
scribe the .a|i^i^^^^to-phonemic stage in speech p^^rception, it pro- 
vides a usefi|pi|l^^ry of possible mechanisms before and after this 
stage. The persistence of a brief, overwritable precategorical 
acoustical store (PAS) is used to explain modality-specific recency , 
effects in short-term serial recall, and is invoked in this paper 
to explain perceptual differences between different phonemic classes 
of speech sound. The logogen 'system provides an interface between 
sensory information and knowledge of the language and the world. 
Thp system consists of morpheme-sized units that can be biased by 
the cognitive system on the basis of its expectancies. This aspect 
<. , of the model handles quantitatively changes in recognition accuracy 

of words with changing frequency, context, and signal-to-noise levels 
(Figure 1, Morton, 1970; reproduced with permission qf publisher and 
author) . 
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Experimental evidence ^n the importance of prosodic variables in perception 
is scattered, but what is available suggests that they have been unjustly ne- 
glected. Martin (1972) has discussed the theoretical implications of rhythm in 

perception, and Huggins (I97;2b) has provided a brief but useful review of the 

^ perceptual significance of prosody in connected speech. One of the points to 
emerge trom this review is that listeners put more trust in prosodic information 
than they do in segmental when the two are in conflict. Wingfield and Klein 
tl971) examined this question by playing to subjects sentences whose intonation 
indicated that a major syntactic boundary was occurring at a point that was in- 
compatible with the words in the sentence. Subjects' transcriptions of these 
mismatched sentences occasionally included word substitutions that changed the 
syntactic structure to be compatible with the intonation. Prosody seems also to 
be trusted more at the word level, since words read in a foreign accent tend to , 
be heard as haying the spoken, though incorrect, stress pattern even if this 
sacrifices useful segmental cues (Bansal, 1966; cited .in Huggins, 1972b).. This 
strategy is perhaps a wise one when we consider the relative resistance to dis- 
tortion of prosodic and segmental information. Speech that is so severely band- • 
limited that the overall intelligibility is' only/. 30 percent still carries enough 

V prosodic information for the stress pattern of the words to be correctly per- 

ceived (Kozhevnikov and Chistovich, 1965). The same is true ''for hummed speech 
' (Svennson, 1974), which carries no segmental information. Using spectrally 
rotated speech, which again gives no segmental information, Blesser (1969) has. 
found that for some sentences thet syntactic structure of the transcribed sen- 
tence corresponds closely with that of the original. 

While these experiments indicate that prosody furnishes useful information » 
about stress patterns, and perhaps about S3mtactic structure, the absence of 
segjnental information, there has been little attempt to model how. the interac- 
tion between syntactic information obtained lexically and that obtained prosod- 
ically is achieved. However, there are two approaches to sentence perception 
that might provide a suitable framework for approaching this interaction. 

IBever (1970) has described a heuristic approach to the perception of sen- 
nces in which words are grouped together according to strategies based mainly 
on the grammatical class of the words. These strategies suggest how the major 
constituents of a sentence could be extracted, and provide a possible theoreti- 
cal link between syntactic processing mechanisms and the use of prosodic infor- 
Doation. If indeed prosody can help to determine syntactically useful segments, 
then strategies using this information could readily be incorporated into the 
type of scheme thAt Bever suggests. Experiments by Scholes (1971a, 1971b) pro- 
vide a start on delimiting the usefulness of prosodic information for resolving 
syntactic ambiguity. In a similar vein, work dcme by Lea (1973) on the use of 
prosodic information as an/aid to automatic speech recognition shows that the 
ends of major syntactic constituents (except noun-phrase/verbal boundaries) can 
be detected quite reliably by looking for a drop in the pitch contour. 



Perhaps the most explicit model for the perception of sentences is the com- 
puter program by Winograd (1972) that allows typed communication in English 
about a world of colored blqcks. The program uses a gratnmar, based on Halliday's 
(1967a, 1970) systemic grammar, that permits a rich interaction between P3mtac- 
tic and semantic constraints during the process of understanding an input sen- 
tence. Its dynamic use of semantic information provides ^ valuable constraint 
on the possible syntactic parsings of a sentence since semantically anomaj,pus 

\ 

• \ 



92 



ERIC 



Y 

parsings can be rejected rapidly. For 
boy plants to water" ^as if it had the same s 
plants to charity" would bi rejected when the 
p-lairtB^ waB^de^t'ecTea/^ constr 
ception of spoken language j tljiere will also be 



tatd prosodic Information it] 



example, an attempt to parse "He gave'^the 
truct:ure as "He gave the house 
sei^iantic" anomaly betweeti boy and 
in t s* could al so "guide^ "the per - ~ 
additional constraints imposed by 



Halliday's (1967b) own work on 
serve as a starting point for 
971) have found Halliday^ s 



the spoken sentence, 
the relation between^grammajr and intonation might 
this link, particularly as 

analysis useful in explainjbhg the time it takes sllbjects to' answer questions 
asked in a variety of intonktions following diffefent introductory sentences. 



CONCLUDING REMARKS 

, The experiments reviewed in t^ middle sectiohs of this chapter have had 
some success at describing jLn inf ormation7proces§iiig terms possible mechanisms* 
at early stages in the perception of brief, syllable- length segments of speech. 
^Their success is to some extent a refli^ction of that of information-processing 
ideas in vision (e.g., Turviey, 1973), since the pariadigms or the methodology 
have often been taken directly from visual work (e.g., Darwin, Turvey, and 
Crowder, 1972; Pisoni and Tash, 1974).^ However,! the success of infotmation-pro- 
cessing approaches in vision has been largely coilfined to "perceptioA at a 
glance," a term that equally describes the scope lof the speech work. We should 
ask whether our present enthusiastic pursuits with i^nf ormation-processing tech- 
niques are more likely to secure a golden fleece br la red herring. Are the 
tools at our disposal likely to l^ad to any real linders tending of the true Qom- K 
Pj-exity of speech perception? what little we knoi about the perception of ex- ^ 
tended utterances and what little we know about the way in which the cues to 
phonological catego^ries change when they occur inlfiiient speech both give pauge 
to any simple attempt to relate auditory "perception! at a glance" to the percep- 
tion of ^fluent speech. Here, as elsewhere, the tdchikiques of the experimental 
psychologist fall short of the' task "that faces hini Too often do psychological 
techniqufes that promise to illumine basic perceptual processes end up instead I 
raising problems confined to their own methodologyl xiAiich shed comparatively 
little light on the original. problem. 



\ What| is needed is a way of modeling the speec 
at once sufficiently complex to allow the rich^iess 
ly represented and yet sufficiently transparent to 
programs for the synthesis of speech by rule alread 
medium for the phonologist (Mattingly, 1971*), but i 
psychologists and those concerned with automatic re 

recently b^en minimal (Newell, 1971; Hyde, 197i) , partly because of the tremen- 
dous technological problems of dealing with auditory signals. But as theafe 
problems are overcome, perhaps we can look forward to program-based models of;, 
the perceptual process being used to stimulate, j and in turn be modified by, obT 
servations by psychologists on how the human brain perc&ives speech. 



perceptual process that is 
f Ithe system to be adequate- 
r dvide ins igh t . Compijte r 
jrovide such a modelling 
action between speech 
oKtiition of speech has until- 



REFERENCES 



Ades, A. E. (1973) Some effects of adaptation on speeih perception. Quarterly 

Progress Report (Research Laboratory of Electronics,! MIT) 111 , 121-129. 
Ades, A. E. (1974a) How phonetic is selective adaptatioh? Experiments on 'syl- 
lable position and vowel environment. Percept . Psycnopbys . 16 , 61-66. 

) :f 



ERIC 



93 



100 



Ades, A. E. (1974b) Bilateral componetit in speech perception? J, 'Acoust, Soc, 
AiQgr, 56 , 610-616. 

Ali, L, T. Gallagher, J. Goldstein, and R. Daniloff. (1971) Perception of co- 
arrlpulMted nasality. J. Acoust. Soc. Amer. 49, 535-54a. 

Allen, J. and M. P. Haggard. (X973) Dichotic backward masking of acoustically 
similar vowels, bpeech i^erceptlon. Report on Speech Researcn in Proja;re§s" 
(Psychology 'Department, The Queen's University, Belfast) Series 2 , no. 3, 
35-39. ' 

Atal, B. S. . (1974) Towards determining articulatory parameters from the speech 
wave. Paper presented at the International Congress of Acoustics, London. 

Bailey, P. (1973) Perceptual adaptation for acoustical features in speech. 
Speech Perception, Report on Speech Research in Progress (Psychology De- 
pS^rtment, The Queen's University, Belfast) Series 2 , no. 2, 29-34. 

Banaal, 'R. K. (1966) The intelligibility of Indian English. Unpublished Ph.D. 
thesis, London University. 

Bastian, J., P. D.Eimas, and A. M. Liberman. (1961) Identification and discrim- 
ination of a phonemic contrast induced by a silent interval. J. Acoust. 
Soc. Amer^ 33, 842(A). 

Bever, T. G. (19^^p) The cognitive basis for linguistic structures. In Cogni- 
tion and the Development of Language , ed. by J. R. Hayes. (New York: 
Wiley). 

Blakemore, C. and F. Campbell. (1969) On the existence of neurons in the 

human visual system selectively sensitive to the orientation and si^e of 

retinal images. J. Physiol. 203 , 237-260. 
Blesser, B. (1969) Perception of Spectrally rotated speech. Unpublished Ph.D. 

dissertation, Massachusetts Institute of Technology 
Bregman, A. S. and J. Campbell. (1971) Primary auditory stream segregation and 

perception of order in rapid sequences of tones. J. Exp^ Psychol. 89 , ^244- 

249. 

Broadbent, D. E. (1958)) Perception and Communication . (Oxfprd: Pergamon 
Press). 

Broadbent, D. E. and P. La^efoged. (1957) On the fusion of sounds reaching 

different sense organs. J. Acoust. Soc. Amer. 29 , 708-710, 
Campbell, -F. W. Wd J. (J. Rbbson. (1968) Application of Fourier analysis to 

the visibili\y of gratings. J. Physiol. 197 , 551-566. 
Carlson, R. , G. FaHt, and B. Granstrom. (1973) Two-formant models, pitch and 

vowel perception. Paper pijesented at the Symposium on Auditory Analysis 

and Perception "bf Speech, 23^7-24 August, Leningrad. 
, Carlson, R. , B. Granstrom, and G.-Fant. (1970) Some studies concerning percep-^ 

tion of isolated vowels. Quarterly Progress and Status Report (Speech 

Transmission Laboratory, Roydl institute of Technology, Stockhplm, Sweden) 

QPSR-2/3 , 19-35. ^ 
Chomsky, N. (1970) Phonology' and reading. In Basic Studies t)n Reading , ed. by 

H. Levin and J. P. Williams. (New York: Basic Books). 
Chomsky, N. and M. Halle. (1968) The Sound Pattern of English . (New York: 

Harper & Row) . 

Christie, W. M. , Jr. (1974) Some cues for syllable juncture perception in 

English. J. Acoust.. Soc. Amer. 55 , 819-821. 
Clowes, M. B. (1971) On seeing things. Artificial Intelligence 2, 79-116. 
Cole, R. A. and B. Scott. (1974a) The phantom in the phoneme: Invariant cues 

for stop consonants. Percept . Psychophys . 15 , 101-107. 
$ Cole, R. A. and B. Scott. (1974b) Toward a theory of speech perception. 

Psychol. Rev. 81 , 348-374. ^ i. 



94 



ERIC 



Cooper, F. S., P. C. Delattre, A. M. Liberman, J. M. Borst, and L. J. Gerstman.^ 
(1952) Some experiments on the perception of sjmthetic speech sounds. 
J> Acoust* Soc, Amer, 24 , 597-606. 

Cooper^ (1974) Adaptation of. phonetic feature analyzers for place of artic- 
ulation. J. Acoust. Soc. Aiaer. 56 > 617-627. 

Cnnper » ( 1 97 5 ) — Selex:^ive-^dap-tat4on— fco^-spee ch > In C ognirfc^ve-^Fheery f— edT~^>y — 

E. Restle, R. M. Shiffrin, J. N. Castellan, H. Lindman, and D. B. Pisoni. 
(Hillsdale^ N* J*: Lawrence Erlbaum Assoc.). 1 

eooper, W. E. and S. Blumstein. (1974) A "labial" feature analyzer in ^^peeqli 
perception. Percept. Psychophys. 15 , 591-600. I > 

Corteen, R. S. and B. Wood. (1972) Autonomic responses to shock-associrfbed ^ 
words in an unatten^d chan|ael. J. Exp. Psychol. 94 , 308-313. ^ 

Crowder, R. G. (1971) Ime sound of vowels and consonants in immediate memory. 
J. Verbal Learn. Verbal Behav. 10, 587-596. 

Crowder, ,R. G. (1973) Representation of speech sounds in precategorical acous- 
tic 1 storage. J. Exp. Psychol. 98 , 14-24. 

Crowder, R. G. and J. Morton. (1969) Precategorical acoustic storage (PAS). 
Percept. Psychophys^ 5, 365-373. 

Cudahy, E. and B. Leshowitz. (1974) Effects of a contralateral interference 
tone? on auditory recognition. Percept. Pgychophys. 15 , 16-20. 

Cutler, A. (1975) Rhythmic factors in the determination of perceived stress. 
Paper presented at the 89th Meeting of the Acoustical Society of America, 
8-11 April, Austin, Tex. 

Cutting, J. E. (1972) Ear advantage for stops and liquids in initial and final 
position. Haskins Laboratories Status Report on Speech Research SR-31/32 , 
57-65. 

Daniloff , R. and K. Moll. (1968) Coarticulation of lip-rounding. J. Speech | 
Hearing Res. 11, 707-721. 

Dazrwin, C. J. (1969) Auditory perception and cerebral dominance. Unpublished 
Ph.D. thesis. University of Cambridge. 

Darwin, C. J. (1971a) Ear differences in the recall of fricatives and vowels. 
Quart. J. Exp. Psychol. 23 , 46-62. 

Darwin, C. J. (1971b) Dichotic backward masking of complex sounds. Quart. J. 
Exp. Psychol. 23 , 386-392. 

Darwin, C. J. (1973) Ear differences and hemispheric specialization. In The 
Neurosciences, Third Study Pi^ogram , ed. by T. 0. Schmitt and F. G. Wordeipx. 
(Cambridge, Mass.: MIT Press) , pp. 57-63. 

Darwin, C. J. and A. D. Baddeley. (1974) Acoustip memory and the perception bf 
speech. Cog. Psychol. 6, 41-60. \ 

Darwin, C. J. and S. A. Brady. (1975) Voicing and juncture in stop-lateral \ 
clusters. Poster presented at the 89th Meet^ing of the Acoustical Society ■ 
of America, 8-11 April , Austin, lex. i 

Darwin, C. J., M. T. Turvey, and R. G. Crowder. (1972) An auditory analogue o^ 
the Sperling partial report procedure: Evidence for brief auditory stor- 
age. Cog. Psyghol. 3, 255-267. i 

Delattre, P. C. , A. M. Liberman, and F. S. Cooper. h.955) Acoustic loci and 
transitional cues for consonants. J. Acoust. Sdc^ Amer. 27 , 769-773. 

Denes, P. (1955) Effect of durati(t)n on perception bf voicing. J. Acoust. Soc. % 
Amer> -27, 761-764. / 1 

Dixit, R. P. and P. F. MacNeilage. ;i(1972) Coarticulation of nasality: Evi- 
dence from Hindi. J. Acoust* feoc* Amer. 52, 131(A). 

Eimas, P. D., W. E. Cooper ,Aand J. D. Corbit. ~^1973) Some properties of lin- 
guistic feature detectolrs.' Pprc^pt. Psychophys. 13 ,. 247-252. 



102 



95 



Eftaias, P. D. and J. D. Corbit. (1973) Selective adaptation of Ifnguistic fea- 
ture detectors. Cog, Psychbl, 4, 99-109. * / 

Evans, E. F. and I. C. Whitfield. (1964) Classification of unft repsonses in 

tlie auditory cortex. of the unanesthetized and unrestr^^ine^ cat. J. Physiol. 
(London) 171, 476-493. ' 

Fant, G. M. (196Q) Acoufltlc Theoiiy-^. -Speech -Production^ (The :Hagua^ 

Mouton) . 

Fant, C. G. M. (1964) Auditory patterns of speech. Quair^terly Progress and 

Status Report (Speech Transmission Laboratory, Royal Institute of Technol- 
ogy, Stockholm, Sweden) QPSR-3 , 16-20. < 

Fant, G. (1966) A note on vocal tract size factors and nonuniform F-pattem ^ 
scalings. Quarterly Progress and Status Report (Speech Transmission Labor- 
atory, Royal Institute of Technology, Stockholm, Sweden) QPSR-4 , 22-30. 

Fant, G. (1967) Auditory patterns of speech. In Models fair the Perception of 
Speech and Visual Form , ed. by W. Wathen-Dunn. (Cambridge, Mass.: MIT 
Press) . 

Fant, G. , J. Liljencrants , V. Malac, and B. Boirovickova. (1970) Perceptual 

evaluation of coarticulation effects. Quarterly^ Progress isfid Status Report 
(Speech Transmission Laboratory, Royal Institute of Tecliinology,* Stockholm, 
Sweden) QPSR-1 , 10-13. 

Fletcher, H. (1929) Speech and Hearing . (New York: Van Nostrand) . 

Foss, p. J. and D. A. Swinney. (1973) On the psychological reality of the 

phoneme: Perception, identif J,ctiton £thd consciousness. J. Verbal Learn. 
Verbal Behav. 12 > 246-257. 

Fourcin, A. J. (1968) Speech source inference. IEEE Trang^. Audio Electro- 
acous t . AU-16 , 65-67. y f 

Fry, D. B. (1970) Prosodic phenomena* In Manual of Phonetics , ed|by 
B. Malmberg. (Amgterdam: North-Holland). 

Fry, D. B., A. S. AbrainBon, P. D. Eimas, and A. M. Liberman. (1962) The iden- 
tification and discrimination of synthetic vowels. Lang. Speech 5, 171-189. 

Fujisaki, H. and T. Kawashima. (1968) The influence of various factors on the 
identification and discrimination of S3mthetic speech sounds. Paper pre- 
sented at the 6th International Congress on Aco4istics, August, Tokyo, Japan. 

Girding, E. (],967) Internal Juncture in Swedish . Travaux de L'Institut de 
Phondtique de Lund VI. (Lund: Gleerup) . 

Gazzaniga, M# S. and R. W. Sperry. (1967) Language after section of the cere- 
bral commissures. Brain 90 , 131-148. 

Godfrey, J* J. (1974) Perceptual difficulty and the right-ear advantage for 
vowels. Brain Lang. 1, 323-336. 

Gombrich, E. (1960) Art and Illusion , (Princeton, N. J.: Princeton Univer- 
sity Press) . * 

Gough, p. B. (1^972) One second of reading. In Langauge by Ear and by Eye , ed. 
• by J. F. Kavanagh and I. G. Mattingly. (Cambridge, Mass.; MIT Press). 

Guzman> A. (1968) Computer recognition of three-dimensional objects in a 
Visual scene. MAC Technical Report (Project MAC, MIT) 59. 

Hadding-Koch, K. and M. Studdert-Kennedy. (1964) AiKexperimental study of some 
intonation contours. Phonetica 11 , 175-185. 

Haggard, M. P. (1971) Theoretical issues in speech perception. Speech Synthe- 
sis and Perception (Psychological Laboratory, University of Cambridge) 4^, 

Haggard, M. P., J. M. Coxrigall, and A. E. Legg. (1971) Perceptyal factors in 

articulator^ defects. Folia Phoniat. 23 , 33-40. 
Halliday, M. A. K. (1967a) Notes on transitivity and theme in English. J. 

Ling- 3, 37-81 and 4^, 179-215. 



96 



103 



Halliday, M. A. K. (1967b) Intonation and Grammar In British English , (The 
Hague : Mouton) . / ^ ^ 

Halliday, M. A. K. (1970) Functional diversity In J^^nguage as seeil from a con- 
sideration of modality and mood In English. F^tfSatlons of Language 6^ 
322-361, - 

>--S^ ( 1 9 58 ) — euea~£o^^--fehe dlGc r im teaMenr o f Am e^^^eanr-Engllsh frleefe-ives- 
In spoken syllables. Lang, Speecl> 1, 1-17. 
Holmes, J. N. , l. G. Mattlngly, and J. N. Shearme* (19(54) Speech synthesis by 

rule. Lang. Speech 7, 127-143. 
Hubel, D. H. and T. N. Wiesel. (1962) Receptive fields, binocular Interactlojci 
and functional architecture In the cat's visual cortex. J. Physiol. 
(London) 160 , 106-154. 
Hugglns, A. W. F. (1972a) Just noticeable differences for segment duration In 

natural speech. J. Acoust. Soc. Amer. 51 , 1270-1278. 
Hugglns, A. W. F. (1972b) On the perception of temporal phenomena In speech. 

J- Acoust. Soc. Amer. 51, 1279-1290. 
Hyde, S. R. (1972) Automatic speech recognition'^ An Initial survey of the 
literature. In Human Communication; A Unified View , ed'. by E. E. David 
and P. B'. Denes. (New York: McGraw-Hill). 
Johnston, J. C. and J. L. McClelland. (1974) Perception of letters in words: 

Seek not and ye shall find. Science 184 , 1192-1194. 
Kay, R. H. and D. R. Matthews. (1972) On the existence in human auditory path- 
ways of channels selectively tuned to the modulation present in frequency- 
modulated tones. J. Physiol. (London) 225 , 657-677. 
Kirsfeein, E. F. (1970) Selective listening for temporally staggered dichotic CV 

syllables. J. Acoust. Soc. Amer. 4j^, 95(A). 
Klatt, D. J. (1973) Voice-onset time, frication, and Aspiration in word-initial 
consonant clusters. Quarterly Progress Report (Research Laboratory of 
Electronics, MIT) 109, 124-136. 
Kozhvnikov, V. A. and L. A. Chistovich. (1965) Speech: Articulation and Per- 
CQption . Translated from Russian. (Washington, D.C.: U. S. Department of 
Coiiimerce, Clearinghouse for Federal Scientific and Technical Information). 
Kuhn, M. (1973) A tWo-pass procedure for sjnci thesis-by-rule. J. Acoust. 

Soc. Amer. 54, 339(A). 
Kuhn, G. M. (1975) On the front cavity resonance, and its possible role in 

p ercep t ion . Haskins Laboratories Status Report on Speech Research 
105-116. " ^ 

R. and L. M. Goldstein. (1974) The psychological representation of 
sounds. Cognition 2, 279-298. 
Ladefoged, P. (1966) The nature of general phonetic theories. Languages and 

Linguistics (Georgetown University), Monograph No. 18, 27-42. 
Ladefoged, P. and D. E. Broadbent. (1957) Information conveyed by vowels. 

J* Acoust. Soc. Amer. 29 , 98-104. 
Ladefoged, P., J. L. DeClerk, G. Pap9un, and M. Lindau. (1972) An auditory- 
motor theory of speech production. Working Papers in Phonetics (Linguis- 
tics Department, University of California at Los Angeles) 22^ 48-75. 
Lea, W. A. (1973) An approach to syntactic recognition without phonemics, 

IEEE Trans. Audio Electroacoust . AU-21 , 249-258. 
Lehiste, I. (1960) Ah acoustic-phonetic study of internal open juncture. 

Phone tic a , Suppl. . 5. - 
Lehiste, I. and L. Shockey. (1972) On the perception of coarticulation effects 
in English VCV syllables. Working Papers in Linguistics (Linguistics De- 
partment, Ohio State University) 12^, 78-86. 




97 



ERiC 104 



Lenneberg, E. H. (1962) Understanding language without ability to speak; A 

case report. J, Abnormal Social Psychol. 65 > 419-425. 
Liberman, A. M. , F. S. Cooper, K. S. Harris, and P. F. MacNeilage. (1962) A 



(1967) Perception of the speech code. Psychol. Rev. 74 , 431-461. 

Liberman, A. M, ^ P. G. Delattre, and F. S. Cooper. (1952) The role of selected 
stimulus variables in the perception of the unvoiced-stop consonants. 
Amer- J. Psychol. 65 , 497-516. 

Liberman, A. M. , P. Delattre, and F. S. Cooper. (1958) Some cues for the dis- 
tinction between voiced and voiceless stops in initial position. Lang. 
Speech 1, 153-167. 



Liberman, A. M. , P. cl Delattre,^ F. S. Cooper, and L. J. Gerstman. (1954) The 
role- of consonant^^owel transitions in the perception of the stop and nasal 

tonsonants. Psychol. Monogr. 68 , 8, Whole No. 379. 
Liberman, A. M. , K. S. Harris, H. S. Hoffman, and B. C. Griffith. (1957) The 
discrimination of speech sounds within and across phoneme boundaries. 
J. Exp. Psychol. 54 , 358-368. 
Liberman, A. M. , F. Ingemann, L^ Lisker, P. C. Delattre, and F. S.. Cooper. 

(1959) Minimal rules for synthesizing speech. J. Acoust. Soc. Amer. 31 , 
1490-1499. 

Liberman, A. M. , I. G. Mattingly, and M. T. Turvey. (1972) Language codes and 
memory codes. In Coding Processes in Human Memory , ed. by A. W. Melton 
and E. Martin. (New York: Wiley), pp. 307-334. 

Licklider, J. c. R. (1951) A duplex theory of pitch perception. Experientia 
2, 128-134. 

Lieberman, P. (1963) Some effects of semantic and grammatical context on the 

production and perception of speech. Lang. Speech 6, 172-187. 
Lieberman, P. (1965) On the acoustic basis of the perception of intonation by 

linguists. Word 21, 40-54. ' 
Lieberman, P. (1967) Intonation, perception, and language. Research Monograph , 

No. 38* (Cambridge, Mass.: MIT Press). 
Lindblom, B. E. F. (1963) Spectrographic study of vowel reduction. J. Acoust. 

Soc. Amer. 35 > 1773-1781. 
Lindblom, B. (1972) Phonetics and the description of language. In Proceedings 

of the 7th International Congress of Phonetic Sciences . (The Hague: 

Mouton) , pp. 63-93. 

Lindblom, B. E. F. and M. Studdert-Kennedy. (1967) On the role of formant 
transitions in vowel recognition. J. Acoust. Soc. Amer. 42 , 830-843. 

Lisker, L. (1957a) Closure duration and the intervocalic voiced-voiceless dis- 
tinction in English. Language 33 , 42-49. 

Lisker, L. (1957b) Minimal cues for separating 7w,r,l,y/ in intervocalic posi- 
tion. Word 13, 256-267. 

Lisker, L. (1961) Voicing lag in clusters of stop plus /r/. Speech Research 
and Instrumentation , Ninth Final Report, Haskins Laboratories, Appendix 



Lisker, L. and A. S. Abramson. (1964) A crcJss-language study of voicing in 
initial stops: Acoustical measurements. Word 20 , 384-422,. 

Lisker, L. and A. S. Abramson. (1967) Some effects of context on voice onset 
time in English stops, Lang. Speech ip » 1-28. 

Lisker, L. and A. S. Abramson. (1970) The voicing dimension: Some experiments 
in comparative phonetics. In Proceedings of the 6th International Congress 
of Phonetic Sciences^ 1967 . (Prague: Academia) , pp. 563-567. 





A-II. 



98 



^1^ 



1 



^Lukatela, G. (1973) Pitch determination by-* adaptive autocorrelation method. 
Haskins Laboratories Status Report on Speech Researfch SR-33 , 185-193. 

Mackwbirth; A. K. (1973) Int^erpreting pictures of polyhedral scenes. Paper 

presented at the 3rd International Joint Conference on Artificial Intelli- 
gence. ^ 

JfafiNellag e , P , F >- a n d P . L adef4;xged. (In-prees ) — The-pced^e^ n of s peech^- -in- 



ERIC 



Handbook of Perception , vol. 7, ed. by E. C. Carterette And M. P. Friedman. 
<New York: . Academic Press)* 
Malecot, A. (1956) Acoustic cues for nasal consonants. Language 32 , 274-284. 
Malmberg, B. (1955) The phonetic basis for syllable division. Studia 

Linguisttca 9^, .80-87. 
Martin, J. G. (1972) Rhythmic (hierarchical) versus serial structure in speech 

and other behavior. Psychol. Rev. 79 , 487-509. 
Massaro, D. M. (1970) Preperceptual auditory images. J. Exp. Psy chol. 85, 

411-417. 

Mattingly, I. G. (1968) Synthesis by rule of General ^erican English. Ph.D. 
.dissefrtation, Yale University. (Issued a^ Supplement to Haskins Laborator - 
ies Status Report on Speech Research ,^ 
Mattingly, I. G. (1971) Synthesis by rule as a tool for phonological research. 

Lang, '^pepch 14 , 47-56. , 
Md^eill, D. and L. Lindig. (1973) The perceptual reality of phonemes, sylla- 
^ ^ bles, words, and sentences. J. Verbal Learn. Verbal Behav. 12 , 417-430. 
Moll, K. L. arid R. G. Daniloff. (1971) An investigation of the timing of velar 

movements during speech. J. Acoust. Soc. Amer. 50 , 678-684. / 
Morton, J. (1970) A functional model for memory. In Models of Human Memory , 

ed. by D. A. Normdn. (New York: Academic Press). f 
Morton, J. and D. E. Broadbent. (1967) Passive versus active recognition / 
» models, or is your homuncultis really necessary? In Models for the Percep- 
tion of Speech and Visual Form , ed. by W. Wa then-Dunn. (Cambridge, Mass.: 
MIT Press) . 

Murrell, G. A. and J. Morton. (1974) Word recognition and morphemic structuxre. 

J. Exp. Psychol. 102 , 963-968. 
Neisser, U. (1967) Cognitive Psychology . (New York: Apple ton-Century-Crrff ts) . 
Newell, A. (1971) Speech-Understanding Systems: Final Report of a Study 

Group (Computer Science Department, Carnegie-Mellon University). \ j 

O'Connor, J. D., L. H. Gerstman, A. M. Liberman, P. C. Delattre, and F, S. 

Cooper. (1957) Acoustic cues for the perception of initial /w,j,r,l/ in 

English. Word 13, 24-43. ' / 

Ohman, S. E. G. (1966) Coarticulation in VCV utterances. J. Acoust. Soc. 

Amer. 39, 151-168. ^ 
Ohman, S. E. G. (1967) Numerical model of coarticulation. J. Acoust. S6c. 

Amer. 41, 310-320. \ ^ 

Peterson, G. E. and H. L. Barney. (1952) Control methods used in a study of 

the identification of vowels. J. Acoust. Soc. Amer. 24 , 175-184. 
Pisoni, D. B. (1971) On the nature of categorical perception of speech sounds. 
* Ph.D. thesljg, University of Michigan. (Issued as Supplement to Haskins 

Laboratories Status Report on Speech Research. ) 
Pisoni, D. B. (1972) Perceptual processing time for consonants and vowels. 

Haskins Laboratories Status Report on Speech Research SR-31/32 , 83-92. 
Pisoni, D. B, (1973) Auditory and phonetic memory codes in the discrimination 

of consonants and vowels. Percept. Psychophys. 13 , 253-260. 
Pisoni, D. B. (1975) Dichotic listening and processing phonetic features. In 

Cognitive Theory, vol. 1, ed. by F. Res tie, R. M. Shiffrin, N. J. Castellan; 

and H. Lindman. (Hillsdale, N. J.: Lawrence Erlbaum Assoc.). 



99 

9^ / 106 



Pisoni, .D. B. and S. D. McNabb. (1974) Dichotic interactions and phonetic fea- 
ture processing. Brain Lang. 1> 351-362. i 
Pisoni, D. B. and J. Tash. (1974) Reaction times to comparisons within and 

across phonetic categories. Percept . Pfeychophys . 15 , 285-290. 
Porter, R. J. (1971) The effetit of temporal overlap on the perception of 
dichotically and monotically presented CV syllables. J. Acoust> Soc. Amer. 



50, 129(A) 

Porter, R. J., D. P. Shankweiler, and A, M. Liberman. (1969) Differential 

effects of binaruaf time differences on perception" of stop consonants and. 
vowels. Paper presented at the 77th Meeting of the American Psychological 
Association, Washington, D.C. ^ ^ 

Posner, M. I. and R. F. Mitchell. (1967) Chronometric analys'is of ' classifica- 
tion. Psychol. Rev. 74 , 392-409. 

Rand, T. C. (1971) Vocal tract arize normalization in the perception of stop 

consonants . Haskins Laboratories , Status Report on Speech Research SR-25/26 , 
^ 141-146. 

Rand, T. C. (1974) Dichotic release 'from masking for speech. J. Acoust. Soc. • 
Amer. 55 , 678-6,80. 

Raphael, L. J. (1^72) Preceding vowel duration as a cue to the perception of 

the voicing characteristic of word-final consonants in American English. 

J. Acoust. Soc. Amer. 51, 1296-1303.. 
Reicher, G. M. (1969) Perceptual recog^iition as a function of meaningfulness 

of stimulus material. J. Exp. Psychol ■ 81 , 275-280. 
Rubin, P., M. T. Turvey, and P. Van Gelder. (in press) Semantic influences on 
^ phonological processing. Haskins Laboratories Status Report on Speech 

Research SR-44 . 

Rubinstein, H. and I. Pollack. (1963) Word predictability and intelligibility. 

J. Verbal Learn. Verbal Bfehav. 2, 147-1-58. 
Savin, H. and T. G. Bever. (1970) The nonperceptual reality of the phoneme. 

J. Verbal Learn. Verbal Behav. 3, 295-302. 
Sawusch, J. R. , D^ B. Pisoni, and J. E. Cutting. (1974) Category boundaries 

for linguistic and nonlinguistlc dimensions of the same stimuli. Paper 

presented at the 87th Meeting of the Acoustical Society of America, April, . 

New York. > - ' 

Schatz, C. (1954),^The role of context in the perception of stops. Language 

30, 47-56. 

Scholes, R.. J. (l971a) Acoustic Cues for Constituent! Structjure . (The Hague: 

Mouton) , I ~' \ 

Scholes y R. J. (1971b) On the spoken disambiguation of superfrcially ambiguous 

sentences. Lang. Speech 14 , 1-11. / 
Schwartz, M. F. (1967)'' Transitions in American English^ /s/ as cues to the 

identi4:y of adjacent stop consonants. J. Acoust. Soc. Amer. 42 , 897-.899. 
Schwartz, M. F. (1968) Identification of speaker sex from isolated voiceless 

fricatives. J. Acoust. Soc. Amer. 43 , 1178-1179. 
Selfrldge, 0. G. and U. Neisser. (1960) Pattern recognition by machine. Sci. 

Amer. 203 (Au^f ) , 60-68. , ^ 

Shankweiler, D. P., W. Strange, and R. Verbrugge. (in pBess) Speech and the 

problem of perceptual constancy. In Perceiving, Acting and Comprehending; 

Toward an Ecological psychology , ed. by»*R. Shaw and J. Bransford. 
^ (Hillsdale, JiT Lawrence Erlbaum Assoc.). [Also in Haskins Laboratories 

Status Report Qn Speech Research (this issue).] 
Shankweiler, D. P. and M. Studdert-Kennedy. (1967) Identification of consonants 

and vowels presented to left and right ears. Quart. J. Exp. Psychol. 19 , 

59-63. 

100 

ERIC 107 



Shearme, N. and J. N. Holmes. (1962) An experiment aX.. study of the ciaBa0±- 
cation of sounds in bontinuous speech according to their distribution in 
the F1-F2 plane. In Proceedings of the Ath International Congress" of 
Phonetic Science, Helsinki, 1961 . (The Hague: Mouton) . ( 

Small, A.^ M. , (1970) Periodicity pitch. In Foundations of Modem Aqdltory 

Theory , vol. 1, ed. by J. V. Tobias. (New Tor k: Academic Press). ~ 

Smith, F. and C. Goodenough. (1971) Effects of context, intonation, and voice 
on the reaction time to sentences. Lang. Speech lA, 2A1-250. 

Stevens, K. N. (1972) The quantal nature of speech: Evidence from articula- 
tory-acoustic data. In Human Communication; A Unified -View , ed. by E. E, 
David and P. B. Denes. " (iJew York: McGraw-Hill). 

Stevens, K. and A. S. House. (1972) Speech perception. In Foundations of 

Modem Auditory Theory , vol. 2, ed. by J. V, Tobias. (New York: Aqademic 
Press) . 

Stevens ^ K* and D. Klatt. (197A) Role of formant transitions in the voiced- 
voiceless distinction for stops. J. Acoust. -Soc. Amer. 55, 653-659. 

Stpwe, A. N. and H. B. Hampton. (1961) Speech synthesis with pre-recorded syl- 
lables and words. J. Acoust. Soc. Amer. 33, 810-811. 

Strevens, P. (1960) Spectra of fricative noise. Lan#. Speech 3> 32-49. 

Studdert-Kennedy, M. (1974) The perception of speech. In Current Ttends in 

Linguistics , vol.12: Phonetics, ed. by T. A. -Sebeok. (.The Hague: Moutdn). 

Studdert-Kennedy, M. and F. S. Cooper. (1966) High-performance reading-'^chines 
for the blind. In Proceedings of the International Conference on Sensory 
Devices for the Blind . (London: St. Dunstan's), pp. 317-340. 

Studdert-Kennedy, M. , D. Shankweiler, and S. Schulmah. (1970^ Opposed effercts 

of a delayed channel on perception of dichotically and "ihonoticall]* presented 
CV syllables, j.' Acoust. Soc. Amer. 48 , 599-602. 

Summerfield, A. Q. (1974) Toward a detailed model for the perception of voic- 
ing contrasts. Speech Perception Report on Speech Research in Progress 
. (Psychology 'Department , The Queen's University, Belfast) ^erie£_2, no. 3, 
1-26. 

Summerfield, A. Q. and M. P. Haggard (1974) Perceptual' processing of multiple 
cuee and contexts: Eff^^s of following vowel upon stop consonant voicing. 
J.-^honetics 2, 279-295. 

Svennson, S-G. (1974) Prosody and grammar in speech perception. MILUS , No. 2. 
(institute of Linguistics, Univei^sity of Stockholm). 

Tartter,' V. C. (1975) H[|alective adaptation of acoustic and^phonetic detectors. 
Unpublished M.A. thesis, Brown UniversLty. 

Tash, J. (1974) Selective adaptation of auditory feature detectors in speech 
perception. M.A. dissertation. University of Indiana. [Published in 
Research on Speech Perception (Departmei^t of Psychology, Indiana Univer- 
sity) , Progress Report No. 1. ] ' , 

Treisman, A. M. (1960) Contextual cues in selective listening. Quart. J. Exp. ' 
Psychol. 12, 242-248. 

Treisman, A. M. and J. G. A. Riley. (1969) Is selective attention selective 
perception or selective response? A further test. J. Exp. Psychol. 79 , 
27-34. ^ ' 

Turvey, M. T. (1973) On peripheral and central processes in vision. Psychol . 
Rev. 80, 1-52. 

Wang, W. S-Y. and C. -Fillmore. (1961) Intrinsic cues and consonant perception. 

J. gpeech Hearing Res. 4^, 130-136'. . 
Warren, R. M. (1968) Verbal transformation effect and auditory perceptual 

mechanisms. Psychol. Bull. 70 , 261-270. , . , 



101 



ERIC 108 



Warren, R. M. and J. M. Ackroft. (1974) ^Dichotic verbal transformation: Evi- ■ 
^ dence of 'Separate neural processes for identical stimuli. J* Acoust* Soc. 

Amer > , Supipl> 56 , '54(A). 
Wheeler, D. D. (1970) Processes in word recognition. Cog. Psychol. 1, 59-85. 
Wightman, F. L. (1973) Pattern-transformation model of pitch. J. Acoust. Soc. 

Amer. 54 , 407-416. , 
Wilson, J.- P. (1973) Psychoacoustical and neurophysiological aspects of >iudi-. 

tory pattern' recognition; In The Neurosciences; Third Study Programt ^ ed. 

by F. 0. Schmltt and F. G. WordeA. (Cambridge, Mass.: MIT Press), v 
Wingfield, A.* and J. F. Klein. (1971) Syntactic structure and acqustic pattern 

in speech perception. Percept . Psychophys . 9 , 23-25. ^ • 

Winitz, H., M. E. Scheib, and J. A. Reeds. (1972) Identification .*of stops and 

vowels from the burst portion of /p,t,lf/ isolated frotn conversational 

speech. J. Acoust. Soc. Amer. 51 , 1309. / 
Winograd, T. (1972) Understanding Natural Language . (New York: Academic 

Press) . 




1^ 




■/ 

On the Dynamic ITse of Prosody in Speech Perception* \^ 
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ABSTRACT ^ 

i 7~ 

Two roles that prosodic variables liiight play in the perception 
of speech are reviewed and two experiments described. Prospdy is 
seen on the one hand ks helping to direct attention to a particular 
speaker aiid to the potentially most informative parts of his speech. 
It is also seen as modifying the hypotheses a listener might enter- 
tain about a sentence as he listens to it. Th& use of models of per- 
ception taken from natural language understanding programs is sug- 
gested as a way to begin modeling this second role of prosody, 

There is little doubt tjiat prosodic varial^les make a signifi'catit contribu- 
tion to the intelligibility of speech. We lack, however, a suitable framework ' 
for modeling, in perception, the relation between prosodic variables and segmen- 
tal information. Part of the reason for this is that many very, different types 
of information can be conveyed by prosodic variables, and it wou]/d be a brave 
soul who denied any of them a role in intelligibility. The segmental distinc- 
tions carried by normally prosodic dimensions, such as change in pitch as a weak 
cue to voicing (Fujimura, 1961; Haggard, Ambler, and Callow, 1970) and lexical 
distinctions carried by stress (e.g., see Fry, 1970), are not particularly dif- 
ficult to integrate into a scheme for the perception of lexllcal items from seg- 
mental cues. But these aspects of prosody exclude perhaps the bulk of the con- 
tribution tl^at prosody makes to the intelligibility of speech. Senten/:es made 
up of concatenated words spoken in isolation are less intelligible than those 
spoken fluently (Stowe and Hamptoh, 1961; Abi:ams and Bever, 1969), a findi^ng 
that is almost paradoxical when one considers (Huggins, 19721) that words excised 
from fluent speech are less intelligible than the same word# spqkln in isolation 
(Lieberman, 1963). The resolution of this paradoat can b^ achieved in part by 
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knowing that coarticulation is no respecter of word boundaries, but it is like- 
ly, if not compelling, that the prosodic infomnation present in fluent speech 
helps perception of the fluent utterance. When coarticulatory artifacts are 
made less likely, prosodic perturbations less drastic than word splicing still 
influence intelligibility. When sentences sharing a common five-word portion, 
but having major syntactic boundaries at different places within this common 
string, are cross-spliced so that an intoi;xation contour inappropriate to the 
sentence's syntactic structure is heard, the intelligibility of these mosaic 
sentences is lower than those with normal intonation (Wingfield and Klein, 1971; 
Wingfield, 1975). This' techi^^ cross-splicing introduces two perturbations 

into the prosody; first it produces an abrupt change in both pitch and rhythm at 
the splice points, aif^^^Jdft^^itd it gives an inappropriate placement of the intona- 
tionai cues to the syntactic boundary. While both of these changes may contrib- 
ute to the reduced intelj.igibility-bf the mosaic sentences, the latter certainly 
has a specific effect sincfe subjects' transcriptions of these sentences include 
ones with the syntactic boundary at the point suggested by the intonation con- 
tour. This result can be compared with the outcome of other experiments that 
have shown that subjects can extract considerable information about the stress 
pattern of speech even when there is virtually no segmental information present, 
either by virtue of its being hummed (Svensson, 1974) or spectrally rotated 
(Blesser, 1969). If indeed information that is a potentially useful indicant of 
syntactic structure can be extracted independently of segmental information, do 
we need to postulate any interaction between segmental and prosodic information 
while the sentence is being perceived? Perhaps Wingfield and Klein's (1971) 
subjects simply reinterpreted the sentences they heard- in the light of an inde- 
pendently perceived intonation contour. N9t very dynamic, but certainly easy to 
model! " " 

Life would be more interesting and speech perhaps easier to perceive if 
prosody played a more dynamic role. This paper aims to review the evidence that 
b^ars on a potentially dynamic role of prosody in the perception of speech and 
to present two new fragments of data that contribute to th±& discussion. The 
review will be based on two putative roles of prosodic variables: the role of 
rhythm and pitch contour in both allowing temporal prediction and giving contin- 
uity to an attended channel, and the role of prosodic variables in delimiting 
higher syntactic structures. 

TEMPORAL PATTERNING ,<JREDICTION, AND ATTENTION 

Reaction-time techniques provide a useful way of studying the ongoing pro- 
cess of sentence perception. The most extensively used technique, the phoneme- 
monitoring task, was introfiuced by Foss and Lynch (1969). Here, subjects, while 
listening to a sentence that they subsequently must recall, have to press a key 
whenever they hear a word beginning with a particular phoneme (usually /b/) . 
Changes in this reaction time have been found to depend on a number of syntactic 
and semantic variables* Reaction ±b longer when the target word occurs in 

a self-embedded than in a right-branching sentence (Foss and Lynch,- 1969), in 
sentences with deleted rather than intact relative pnronouns (Hakes and Cairns, 
1970; Hakes and Foss, 1970), in sentences with reduced rather than intact com- 
plements (Hakes, 1972)^ and after high rather than low frequency words (Foss, 
1969; at least- for adjectives of different frequencies, Cairns and Foss, 1971). 
These, differences have been interpreted as reflecting changes in the processing 
load imposed by the sentence (cf. Aaronson, 1968) and should not be thought of 
as reflecting directly the extraction of segmental information. The conscious 
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accesis to the phonemic level that this task requires is riade after (or at least 
is influenced by) decisions about what the word containing the target is. Reac- 
tion times are shorter if the target starts a word rather than a nonword (Rubin, 
Turvey, and van Gelder, in press; see also Fos3 and Swinney, 1973; Treisman and 
Tuxworth, 1974, for discussion of related points). 

The phoneme monitoring task has been used in two contexts that relate to 
the us^ of prosodic information. An experiment by Cutler and Foss (1973) pro- 
vides a convenient starting point. Following an earlier finding tha^t reaction 
time to an intital stop consonant was faster when it appeared at the beginning 
of a content' than of a function word, they showed that this difference was 
attributable to differences in the stress with which function and content words 
are normally spoken. When stress and lexical category were varied independent- 
ly, the faster reaction time accompanied the stressed word. Although unstressed 
function words gave slower reaction times than unstressed content words, they, 
felt this could be owing to different degrees of stress. 

This experiment alone can support a number of interpretations: (1) stressed 
^words may be articulated more precisely so that there is more segmental informa- 
tion available, (2) stressed words may be processed more efficiently on account 
of their intrinsic stress, (3) stressed words may be processed more efficient- 
ly through the preceding stress pattern, suggesting that they will be stressed. 
This last hypothesis is compatible with recent comments on the use of rhythm by 
Martin (1972). 

The first two interpretations attribute the faster reaction time for 
stressed words to factors intrinsic to the word itself, and can now be^dismissed 
with some confidence. Shields, McHugh, and Martin (1974) measured reaction time 
to nonsense disyllables beginning with a target phoneme. The disyllable was 
pronounced with the stress either on the first qr the second syllable. When the 
word occurred as part of a fluent sentence, subjects were faster to a target in . 
a stressed syllable than in an unstressed one, provided that the target did not 
occur too close to the end of the sentence. ^However, when the same target words 
were spliced out and presented in isolated list form, the differences between 
stressed and unstressed syjllables disappeared. The curious lack of difference 
when the targets occur at the end of the sentence may be simply a floor effect, 
since reaction times decreased throughout the sentence for both t3rpes of sylla- 
bles. This study is complemented by a recent experiment by Cutler (cited in 
Cutler, 1975) in which target words were spliced into a sentence context, whose 
prosody suggested that the target word would or would not be stressed. She 
found that reaction time was faster when subjects expected a stressed word than 
when they did not. The stress difference thus seems to be independent of the 
intrinsic stress of a word and depends rather on the preceding prosodic pattern, 
which allows the subject to anticipate the forthcoming stressed word. 

If some form of anticipation is the cause of the s tressed-syllable advan- 
tage, then we might expect two further effects; some time should be necessary 
for subjects to get the rhythm of the utterance on which prosodic predictions 
might be based, and local disturbance of the rhythm should -jipset the s tressed- 
syllable advantage^ There is evidence for both. Aaronson '(1968) asked her sub- 
jects to monitor a list of digits, spoken at a constant rate, for the occurrence 
of a target digit. There was a decrease of about 100-msec reaction time to the 
target over the first three serial positions. This occurred whether subjects 
subsequently had to recall the list of digits or not. Reaction times remained 
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steady In subsequent serial positions for subjects who had only to monitor the 
lists, but they increased after about the third item until the end of the list 
for subjects who had to recall the lists as well as monitor. We will return to 
this latter finding below; for the present it is enough to note that the initial 
decrease in reaction time may well reflect subjects' accommodating to the rhythm 
^bf the list. Cutler (1975) showed that local temporal disturbances influence 
phoneme-monitoring reaction times. She found that inserting a quarter-second 
period of silence before a pair of monosyllabic words embedded in a fluently 
spoken sentence influenced the reaction time to a phoneme target at the begin- 
ning* of either of the words. The' reaction time was slowed to the second word, 
^ich ^as initially spoken with stress, and quickened to the unstressed first- 
word. Cutler failed to find any advantage for the stressed word over the un-, 
stressed when the silent interval was absent; this may possibly be attributable 
to the target's position In the sentence (cf. Shields et al., 1974). 

The picture emerging from these studies, then, is of subjects' attention be- 
ing allocdtjBd preferentially toward the stressed syllables in a sentence (Shields 
et al., 1974). This is made easier by the timing of speech (at least in a 
stress-timed language) being determined by stressed syllables. Allen (1972), 
for example, showed that subjects can tap With less variability to stressed syl- 
lables than to unstressed (even when the sentence in which they were embedded 
was heard repeatedly), and Huggins (1572) finds subjects to be more tolerant of 
timing distortions that preserve § tressed-syllable separations. As Cutler and 
Fbss (1973) point out, allowing processing to be directed toward the stressed 
parts of the sentence allows the focus of- the speaker's sentence to control the 
listener's perception. However, we might note in passing that it is not yet 
clear how much the results of these experiments depend on using a temporal response 
measure. The task of targeting for a phoneme might cause an undue advantage to 
those portions of the speech stream that are temporally predictable. It would 
strengthen the case ±^ other measures, such as perhaps the detectability of fea- 
ture substitutions, showed a similar increase in stressed syllables. Some sup- ' 
port comes from an experiment by Dooling (1974). He found that subjects' imme- 
diate recall of sentences heard in noise was improved if they had previously ^' 
been exposed to sentences with a. similar stress pattern. His results also 
showed that repetition of stress pattern was much more effective at improving 
performance than was repetition of syntactic structure, when these two variables 
were independently manipulated. 

Although Martin's (1972) theorizing emphasizes the predictive nature of 
rhythm, others have shown that pitch contour can play a similar role. The phe- 
nomenon of "primary auditory stream segregation" (Bregman and Campbell, 1971) 
illustrates this, and again, like Martin's work, embraces both speech and music. 
A random sequence of six notes (three high, three low), when played rapidly, 
will perceptually segment into a high and a low tune, despite lack of any great- 
er rhythmic cohesion within than between tunes. The analogy with speech perhaps ' 
lies both in the use of frequency continuity of formants to help in their track- 
ing (cf. Dorman, Cutting, and Raphael, 1975) and in thia use of continuity of 
pitch to help in attending to one voice against competing sounds. This last 
point has been pursued in an experiment carried out in 1973 by Darwin and Davina 
Simmonds. While this experiment was motivated by the Bregman and Campbell 
(1971) findings, it does not in fact distinguish between rhythm and pitch, but 
looks at the influence of prosodic variables in general on the ability to attend 
to a particular speech source. 
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The tech^jLque of shadowing, althbu^h^complex for the subject, provides a 
valuable way Vf studying the process of selective attention to^ a particular 
speaker. Treisman (1960) asked subjects ^o X^^^ow a passage of coiitinuous 
speech led to one ear £md to ignore a similar passage led to the other. At some 
point during the passage the two channels were sMtched so that the passage that 
had been shadowed continued in the other ear. She found that subjects occasion- 
ally gave, as part of their shadowing response, words that had in act occurred 
immediately after the switch on the ear they had been Instructed to ignore, and 
that these intrusions w^e more common the more redundant . the passages^. Inas- 
much as this effect has any sensory basis, we can ask whether the tendency for 
subjects to shadow words from the unattended ear is being determined by some 
semantic/syntactic priming, as Treisman maintained, or rather whether the con- 
tinuity in rhythm and pitch across the ears at the switch point is sufficient- to 
cause a momentary change in the ear from which the attended auditory input is 
drawn. 

■Our experiment was basically a repetition of Treisman' s but with indepen- 
dent | manipulation of prosodic and semantic factors. Fairs of passages of about 
50 wprds each were selected from short stories by H. E. Bates. From each pair 
of passages, four recordings werq made by the same female speaker. Two were of 
the priginal passages, and two were made by reading the first part of one pas- 
sag0 followed smoothly by the second part of the other. The switch poinit be- 
tw^^n the passages was later than halfway through each passage and was !|lways 
prior to a word beginning with a stop consonant (to facilitate Bubsequent^t^^plic- 
ing) but was otherwise placed at random (although never' at a major clause 
sentence boundary). From these four original recordings, four different dickotic 
conditions were made: a normal condition in which the two original passages \ 
were paired together (aligned to give simultaneity of the stop closures) ; a 
semantic change condition, made by pairing together the other two original re- 
cordings; an intonation change condition, made by switching the latter pair of 
passages after the stop closure; and a condition in which both semantics and in- 
tonation changed, made by switching the - two original passages after the stop 
closure. The first and last of these conditions repeat Treisman 's (1960) condi- 
tions, the other two vary independently semantic and intonational continuity on 
the attended ear. All four conditions were made up with the same number of 
splicings and rerecordings to preveat the preparation of the tapes selectively 
introducing artifacts into the experiment. ^ 

Each of 14 subjects was instructed to shadow the passage on one ear with as 
little lag as possible between hearing the speech and saying it. They were told 
to try not to chunk the speech into phrad^s before speaking it, and were given 
five practice trials on similar dichotic passages that did not have any swif^^es. 
They then took eight different dichotic passages with each of the four experf- 
.mental conditions appearing twice in a counterbalanced order. 



Two types of errors are distinguished in. the results. A particular trial 
is classed as having an omission errop If the subject misses at least two words 
over the break point, and it is classed as an Intrusion error if the subject 
shadowed any words from. the unattended ear over the break point. Table X shows 
the distribution of errors over the four different types of break. 
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TABLE 1: Percentage of trials on which omission or intrusion errors occurred 

for different types of discontinuity and for subjects who ^either shad- 
owed continuously or "chunked" their responses. 

V . Type of discontinuity on ehadbwed ear 



Chunked (5 ^s) 
Continuous (9 Ss) 
TOTAL (14 Ss) 





No break 


Intonation 


Semantic 


Both 


Intrusions 
♦ 


0 


'20.0 


■■ Q'S- 


0 


Omissions 


10.0 


40.0 


90.0'/ 


50.0 


Intrusions 


0 ' 


77.8 


5.6^ 


55.6 


Omissions 


0 


11. 1 


55.6t ' 


22.2 


Intrusions 


0 


57.1 




35.7 


Omissions 


3.6 


21.4 


67.8 


32.1 



In the results for all the subjects cofabined, tfc^ere yis a significant change 
in the number of intrusion and omission errors across^he- three experimental 
conditions (£<.01 and <.05, respectively). This variation is; due $o there being 
significantly more intrusion errors in both the conditions with iiifibnation. 
changed than when only thel semantics changed (el<.05), and through there being 
significantly more omission errors in the condition with the semanitics alone 
changed ' than in the other two experimental conditions (2.<.01)» A cloafer look at 
the subjects responsible for the various types of errors showed, In addition, 
that very few intrusion errors were made by the five subjects/ who despite their 
instructions, chunked their shadowing responses. On^ssion errors^ by contrast, 
were distributed more evenly between the subjects, but tended to be made more by 
the subjects who chj^nked their responses. In pther words, fqr subjects who 
shadow continuously, an abrupt change of the l^tonatipn contour between the ears 
causes intrusion errors to occur, while an abrupt change in ^he semarittc content 
of the message., but without any switch in the intonation contour, gives omls^bn 
errors. Subjects who chunk their shadowing responses likewise produce a lar^ 
number of omission errors on semantic-switch trials, but ^how' a different pattern 
from the subjects who shadow continuously on the intohatiOn-sWltch trials, giving 
more omission than intrusion errors. ' ' 

Clearly, the subjects who chunk their, responses aje^ much ;piore sensitive to 
the semantic constraints on the message they are shadowing, i4f thdt they show 
omission errors when there is a semantic discontinuity on the attended ear. 
Nevertheless they do not show any tendency for intrusions to orfcur from the 
other ear when only the semantic information switches ears. AiJLthough the Sub- 
jects who shadow continuously are disrupted less by the semant:jL'c discontinuity 
than their chunking colleagues, they too show no tend^nc^^ for intrusions to occur 
from the semantically appropriate unattended ear. Rather than assuming that in- 
trusion errors are caused by some semantic priming of a context;ually likely 
word, it seems, at least in this experiment, that they are occurring only when 
there is a continuity of infonation across the two ears. For a brief time after 
a switch in intonation has occurred, the continuity of the intonation contour 
overrides the ear of entry as a criterion for selecting the speech that needs to 
be attended. If subjects are shadowing continuously, this leads to intrusion 
errors from* the ostensibly unattended ear. But if the subject is chunking his 
response, he has time between hearing the potentially intrusive words and making 
his response to omit the words from the inappropriate ear. This, however, causes 
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omission errors, particularly when there is no semantic continuity on the 
attended ear to help him retrieve the words that occurred there immediately 
after the break. ' ' 

This finding, that prosody helps' a listener to attend to a particul^ar 
speaker, complements the work reviewed earlier on the predictive use c/t prosody 
in anticipating stress. Both typea of experiment emphasize the role/prosody 
plays in controlling which parts of the speech stream are attended "to, whether 
the selection is made temporally or spatially. 

^ PROSODY AND THE STRUCTURES OF SPEECH 

As well as allowing anticipation of the speech stream, prosodic factors 
doubtless. also play a role in segmenting speech into higher-order structures. 
One rather simple way in which prosody delimits complex structures in speech la 
through pauses. Goldman-Eisler (1968) claims that even at its most fluent, / 
over two-thirds of speech consists of utterances of six words or less. Pauses/ 
virtually never appear within a word, and their length is determined by the type 
of syntactic juncture that they occur in. They tend to be longer at the end of 
sentences than- at the beginning of su^rdinate clauses, and before a subordinate 
clause their length la determined by the type of clause (Goldman-Eisler, 1972). 
Pauses have been used as experimental variables in studying memory for list:$ of 
items that do not naturapLly fall into larger structures, and we might get some 
insight into their usfe in natural language, by looking for a moment at th^lr 
effect in unnatural situations. 

A brief period of silence after an item in the middle of a list of digits 
provides an opportunity for rehearsal of the preceding group of digits (Kahneman 
Onuska, and Wolman, 1968; Ryan, 1969; Kahneman and Wright, 1971), as well 
as allowing the additional information from an auditory memory to be utilized 
(Crowder and Morton, 1969). Merely Indicating by instructions or interposed 
tones in a regular list how that list should be grouped gives substantially 
prbbrer overall performance and a deduced scalloping of the j^serial recall curve 
(Ryan, 1969). This is perhaps, because less capacity can be^allocated to reheas- 
al when there are competing perceptual demands. As ^ell as delimiting groups of 
items for rehearsal, pauses also perve to delimit larger structures for coding 
in/ long-term memory. The cumulative learning that occurs when supraspan strings 
of items are repeated occasionally throughout a short-term memory experiment 
(Hebb, 1961) is reduced when thqfse strings are grouped differently by pauses 
(Bower and Winzenz, 1969), but lihis reduction only seems to occur when the items 
that compose the string are amenable to some higher-level recoding (Laughery and 
Spector, 1972). 

Coming closer to real spa^sch^ reading a list of nonsense syllables in d 
natural intonation can increase the number of syllables remembered immediately 
after, provided the syllables/ contain additional bound morpheme clues to the in- 
tended syntactic structiire ((^'Connell, Turner, and Onuska, 1968) . Neither in- 
tonation, nor the presence or bound morphemes alone is sufficient to increase * 
the number of syllables redalled. Incidentally, both ti^se additional cues also 
seem to be necessary to pr6cure a right-ear advantage (Zurif and Salt, 1970; 
Zurif and Mendelsohn, 1972{ Zurif, 1974). 

It is tempting to extrapolate from these experiments to a view of clause or 
sentence perception that delays syntactic processing of clausal unit until a 
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clause boundary has been reached » so that the end of the clause Is, like the 
pause in a string of digits, a time of great mental effort (cf., Kahneman et al., 
1968; Wright and Kahneman, 1971), but there is some evidence against this. As 
we remarked earlier, Aaronson (1968) found that reaction time to a target digit 
in a string that the subject had to remember increased toward the end of the 
list, reflecting the increased memory load, ly contrast, there is very little • 
evidence that a similar increase in reaction time occurs for a phoneme target in 
a sentence. Foss and Lynch (1969) did find such an increase, but all relevant 
subsequent studies have found a decrease in reaction time throughout the sen- 
tence (Foss, 1969; Hakes and Foss, 1970; Shields et al. , 197A) . Similarly, re- 
action time to a click played at the end of a clause decreases with increasing 
length of the clduse, while reaction time to a click presented between clauses 
shows no consistent increase with length of the first constituent (Abrams and 
Bever, 1969). Nor is reaction time to a click consistently faster at the begin- 
ning of the second clause than in the clause break (Abrams and Bever, 1969; but 
see Bever and Hurtig, 1975). It is clear then that the constituent words of a 
clause are not being held in memory like so many digits. Something much more 
dynamic 'is happening. But what? Perhaps the best type of model to use here 
would be one based on'Winograd* s (1972) integrated language understanding sys- 
tem. In this program both syntactic and semantic Information is used, as each 
successive word in a sentence is encountered, to construct procedures that can 
subsequently be used to take *action*on the sentence. The end of a clause does 
not in this scheme imply a part^icularly energetic syntactic activity on the part 
of the computer, provided the syntactic organization that it had presumed during 
perception of the preceding word string is comparable with the string's ending. 
Incdmpatibility requires backtracking, but the program's use of semantic con- 
straints for eliminating alternative syntactic structures as it goes along re- 
duces the likelihood of this being necessary. If these sorts of syntactic and 
semantic constraints are also Ifeing used dynamically in natural perception, then 
we might suppose that prosody is uised in a similarly dynamic and interactive way 
to guide the search toward an appropriate syntactic organization of the sentence. 

Some of the results of an experiment run recently by John Capitman in col- 
laboration with myself and Susan Brady bear a little on these issues. Our ex- 
periment was designed primarily to be a replication of Wingfield and Klein's 
(19!71) interesting finding that inappropriate intonation contours impair immedi- 
ate recall of a sentence. 

EXPERIMENT II 

The sentences we used in this experiment were similar to those used by 
Wingfield and Klein (1971). Each sentence that the subjects heard was one mem- 
ber of a group of four sentences derived from pairs of sentences sharing a com- 
^mon string of words. Two of the sentences in the group of four were the origi- 
nals with appropriate intonation, the other two were generated by cross-splicing 
the common sty^ing of words between these two sentences. This string always 
started with a stop consonantfeto facilitate silent splicing, and continued to 
the end of the sentence to reduce the rhythmic discontinuity that splicing can 
introduce. The sentences wer^ between 10 and lA words long and the common string 
was between C^mI 10 words long. Unlike the Wingfield rM Klein material, the 
major syntactic ooundary in each sentence was both preceded and followed by at 
least two words of shared material, this ensured that the intonationally sug- 
gested boundary in the mosaic sentences generated by cross-splicing was also 
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preceded and followed by at least two words of shared material. As In the 
Wingfield and. Klein experiment$ the sentences were presented monaurally and the 
ear of presentation was switched within the sentence. This switch was always 
made during a stop closure to prevent extraneous clicks. The switch point al- 
ways qccurred within the shared material and was never at the intonationally 
suggested boundary, but was otherwise unconstrained. The recordings were made 
on the Hdskins Laboratories pulse-code-rmodulation (PCM) facility. 

Each of 24 subjects heard one sentence of the four generated from each of 
the 11 pairs in an order that counterbalanced conditions (normal versus cross- 
spliced) . Subjects were instructed to write down each sentence as soon as they 
had heard it. In order to make the task more'dif ficult , they were tojd to 
listen for the occurrence of stop consonants and to circle these in their an- 
swer. Their performance on this task wa^ ignored. 

m 

On average, 55.5 percent of the normally intonated (unspliced) sentences 
were recalled correctly, while only 44.5 percent of the cross-spliced sentences 
were recalled correctly. In examining the errors that subjects made, four types 
of error- are distinguished: 

!• Omission errors ; Each word omitted, whether or not this omission 
changed any other aspect of the sentence, was scored as an omis- 
sion error. 

2. Lexical errors ; Each changed word in a string that did not affect 
the primary grammatical relations (the truth conditions of the 
sentence) was scored as a lexical error. Changes in tensed mood, 
and aspect were also counted as lexical errors. 

3. Syntactic errors ; Each change in any segment of a string that re- 
^ suited in a change in the grairaaatical relations or the truth con- 
ditions of the sentence was scored as a syntactic error (except 
for errors of tense, mood, and aspect). No attempt was made to 
distinguish those changes that seemed to conform to the intona- 
tion pattern from those that did not. One other erf or type was 
defined that depended on the position of the error but not on its 
type: 

4. IB2 errors ; [This score applied only to the cross-spliced sen- 
tences.] Each error of any type occurring within one word on 
either side of the intonatiojially suggested boundary was scored 
as an IB2 error. 

Summing together all the first three types of errors (omission, lexical, 
syntactic), Capitman found significantly more errors on the cross-spliced than 
on the normally intonated sentences. This difference is significant by a 
Wilcoxon test both across subjects (j^<.01) and across sentences (£<.02). More- 
over, this increase in errors is due to an increase in the omission (£<.02 by 
subject; Ji<.05 by sentence) and syntactic errors (£<«01 by both subject and 
sentence), but not to an increase in lexical errors. An exception is seen in 
three of the sentence pairs that show a decrease in syntactic errors in the 
crossed-intonation condition, though still an increase in omission errors. All 
of these sentences have an "If . . . then. . construction, which may impose greater 
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syntactic or semantic constraints on the subjects' responses, causing them to 
give up on their response rather than change the truth conditions of th^ sen- 
tence. 

Although these general results confirm Wing fie Id and Klei^'s (1971) find- 
ings on intelligibility with a different, set of sentences, they do not get us 
much nearer to the question of hov the inappropriate intonation contour influ- 
ences recall. Inspection of the IB2 errors does bear on this question. Whdn • 
listening to the crossed-intonation sentences, subjects make moi;e errors within 
one word on either side of the intonationally suggested boiAdary when this bound- 
ary precedes th^ major syntactic boundary than when it folftws it (£<.025, by 
subject; £<,05 by sentence). This is not owing to subjects' making tnore errors ' 
earlier in the sentence, ^ince the reverse tends to be th.e case. This asymmetry' 
in the .disruptive effectsiof an inappropriate intonational cue to a clat^se 
boundary is what one might expect on the hypothesis that intonational infomna- 
tion is being used d3mamlcally to restrict syntactic \iypotheses as the subject 
listens to the sentence. If prosody were being used Bolely tb restrict ^altema- 
tives after initial perception of the sentence, it is difficult to see why the 
relative positions of the syntactic and intonational boundaries should matter. 
On a more dynamic vitew, it Is possible that the inappropriate intonational 
boundary leads to baCKtracking throughout the previous Incomplete clause; this 
clause will be much longer, and so the effects of backtracking will be more dis- 
ruptive when the incomplete clause is the first orre of the se^tence. Hence, 
having the intonational boundary before the syntactic boundary will be more dis- 
ruptive than having it after it. Looking at errors only in the region of the 
intonation boundary appears to be a more sensitive measure than looking at errors 
over the whole sentence, since there did not appear to be any significant dif- 
ference in errors over the sentence as a wh^le as a function of these relative 
locations. This perhaps suggests that backtracking affects perception of the 
part of the sentence that is being actively perceived when it occurs. But we 
cannot entirely rule out the possibility that this effect is being caused in 
part by the rhythmic discontinuity that cross-splicing introduces immediately 
after the breakpMnt. ^ Although these conclusions are extremely speculative and 
the evidence on which 'Vthey are based is not strcmg, we feel that the questions 
raised by this sort of approach are extren^ely interesting, arid we hope to pursue 
them in subsequent experiments. * * 

In simmjaxry, this paper has tried to draw attention to some of the possible 
dynamic roles that prosody might play^in the perception of speech: the rhythmic 
and melod*ic aspects of speech may allow the listener to predict when potentially 
important speech material will arrive (and perhaps allow him to allocate his 
processing capacity accordingly); they may also allow him to attend selectively 
to one voice among many more readily. In addition, prosody undoubtedly plays a 
role in delimiting higher-order structures in speech, and the suggestion is made 
here that this is done by dynamically modifying the hypotheses the listener en- 
tertains while listening to a sentence. * While we neither pretend that this ex- 
hausts the uses of prosody, nor believe that the study of isolated sentences is 
the most appropriate way to study a variirf)le that. depends so much on intersen- 
tence cotftext (see Sfmith and Goodenough, 1971, for an interesting example of in- 
tonatioijj^''inter acting with context in a perceptual task) , w^ do hope that the 
issues «q4.fied here will help to stimulate work on an unduly neglected area of 
speech perception. 
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Speech and the, Problem of Perceptual Constancy* 
' Donald Shankweller,"*" Winifred Strange,"*^ and Robert Verbrugge"*^ 
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J . ABSTRACT 

Speech signals are intrinsiofilfy variable for many reasons. In 
this paper we consider the implications of variability for a theory 
of vowel perception. Current theories of the vowel emphasize the 
^relational nature of the acoustic cues since -^ao^b so lute values of > 
formant frequencies could unambiguously distinguish vowels produced 
by different talkers and in different phonetic contexts. It has been 
assumed that the perceptual proa^s of vowel i<renti'f ication includes 
a normalization stage Whereb^r the listener calibrates hi^ perceptual 
apparatus for each talker, atcordii^g to some reference derived from 
preceding utterances by that talker. We have been unable to obtain 
evidence for such a perceptual mechanism. Theories of vowel percep- 
tion have failed to give due weight to/ the richness of the natural 
speech signal. We attempt to show ^ly the invariant acoustic infor- 
mation that specifies a vowel cannot be found in a temporal cross 
section, but can only be specified over time. , " 
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^Sp^ech percepti9n is not ordinarily included in the body of phenomena and 
theory that convention defines^as the psychology of perception. Yet the problem 
of how the perceptual categories of speech are specified in the acoustic signal 
is a- primary example of the problem of perceptt;al constancy. In spite of its 
neglect by ps>cho.logy^ speech and its signal have beian intensively studied by 
members of other disciplines. We think that some of the results and puzzles 
generated by this research are relevant to the concerns of our colleagues whose 
primary interests are in- other facets of human cognition. 

Speech perception at any level involves classification. The classificatory 
step is assumed whenever we move beyond a purely physical (acoustic) description 
of speech to a psychophysical description in terms of perceptual units. Unlike 
certain problems in traditional psychophysics in which the choice of units m^ay 
be arbitrary, there is wide consensus about what the units of perception are fn 
the case of speech. This consensus is the product of centuries of linguistic 
investigation, during which many attempts have been made to isolate the various 
levels and units that constitute our perception* of speech. Viewed in terms of 
structure, speech is a hierarchical system that manifests what Hockett <1958) 
called a "duality of patterning": it employs both meaningful and meaningless 
units. Morphemes (or, roughly speaking, words) are the smallest of the meaning- 
ful units. In all languages morphemes have an internal structure composed of 
smaller meaningless segments, the phonemes (Bloomfield, 1933). Since the com- 
' munication of meanings ultimately rests on a foundation of phonem:^c structure, a 
basic part of th^ task of understanding how speech is perceived is to discover 
the conditions /tor the perception of phonemic categories.^ For present purposes, 
we shall igjUM^ meaning and concern ourselves only with the phonemic message — 
7^ that i^, with the perception of syllables and their phoneme segments, familiar 

to us as the consonants and vowels. 



In speech, as in handwriting, no two "signatures'* are alike. In generating 
the "same" phoneme, '^diff^rent\ speakers do not produce sounds that are acousti- 
cally the same. Indeed, the same signal is never exactly repeated by the same 
speaker. In perceiving speech,, as in identifying objects, we ordinarily regard 
only tho^e distinctions that are critical, ignoring those that are merely inci- 
dental. While no one woulci deny that speech signals are intrinsically variable 
for many reasons, the implications of this variability for perception have not 
been widely appreciated. In brief, they constitute a major problem in percep- 
tual constancy. i 
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The phoneme is the minimal unit by which perceivers differentiate utterances. 
For example, the word bad has three phonemic segments, /b/, /se/, and /d^ 
that differentiate it from such words as dad , bed, aiid bat . In different' 
utterances, a phoneme may be realized acoustically in different wigiys} linguists 
call these variants "phones." For example, the final /t/ in bat might be 
either released (acoustically, a pause followed by a burst) or unreleased (no 
burst). The class of phones is potentially infinite, and^it is arguable 
whether phones (however defined) are natural perceptual units. Oui: emphasis in 
this paper is on how the identity of a phoneme is perceived despite variations 
in its acoustic form. 
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THE PERCEIVING MACHINE AND TttE ONE-TO-ONE PROBLEM 

Altiiough our concern is with how the human perceptual system works, it may 
help us to bring this problem into focus if we consider how a machine might 
proceed in recovering a string of phonemic segments. Consider, for example, the 
problems to be solved in designing a voice-operated typewriter. The goal of 
such ari automatic speech recognition device is to type out the appropriate 
string of phonemic symbols (or perhaps standard orthographic symbols) in re- 
sponse to any speech input. In t|).e-4lmplest case,lthe only information avail- 
able to €he device will be the a^ustic waveform itl^^^^. A human listener, of 
course, can usually take advantage of other sources ofVi^if ormation, including 
both the linguistic and the situational context of "tildSgiiLterajicfe. 2 While ac^ 
knowledging the Importance of context, we should not overlook the fact that lis- 
teners can identify arbitrarily chosen words and nonsense syllables with high 
accuracy when listening conditions are favorable. In other words, we are not 
posing an unrealistic problem for our h3npothetical device; human listeners can 
do remarkably well when little contextual information is available. 

Many attempts have been made in recent decades to design a voice-operated 
typewriter, but the problem has so far proved elusive. Despite a degree of suc- 
cess with severely restricted vocabularies when words are spoken by a trained 
talker, a generally useful speech recognizer continues to be unattainable. A^ 
Hyde (197g:399) notes, "there are still no devices which can perform even moder- 
ately well on normal (conversational) speech in normal (noisy) environments by a 
normal range of talkers." 

r . . 

It d,s worth considering for a moment how the operation of a speech recogni- 
tion system has typically been conceived. As in many other automatic pattern 
recognition devices, the procedure involves two stafees. In the first stage, the 
basic units or segments are located. For example, liTaTutomatic reading of 
print, this would correspond to i^lating individual letters. In the second 
stage, each segment is identified as an instance of one of a fixed set of ob- 
jects. In the case of print reading, this would correspond to identifying a 
segment as a particular letter of t3be alphabet. Thus a successful voice-oper- 
ated typewriter would have to be able to perform two operations on the acoustic 
waveform of any speech signal. First, it would have to divide thai waveform into 
acoustic segments that have a one-to-one correspondence through time with the 
sequence of phonemes in the utterance. Second, it would have to detect the 
presence or absence of acoustic features that are, critical for identifying par- 
ticular phonemes. This second stage is often conceived of as requiring a set of 
filters, , each filter being tuned to a critical acoustic property (defined along 
dimensions such as frequency, intensity, and duration). 

This strategj^ implies certain widespread assumptions that only in recent 
years have been successfully challenged. For example, we find that in the 



2 

The importance df context to human listeners has been elegantly demonstrated by 
Miller, Heise, and Lichten (1951) , who found a remarkably predictable relation- 
ship between the amount of acoustic distortion that yields a given level of in- 
telligibility and the informational redundancy of the message. For a given 
signal-to-noise ratio, intelligibility was greater for words hear^in sentence 
context than for words heard in isolation. - ) 
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standafd accounts of speech acquisition, it is tacitly assumed that speech con- 
sists of a collection of elementary sounds "transparent" to the infant, such 
that he automatically recognizes a parent^'s utterance of /d/ as "the same sound" 
as his own utterance of /d/ (Allport^ 1924; Watson, 1924). Similarly, taxonomic 
linguists working in the tradition of Bloomfield (1933) supposed that all lan- 
guages sampled from a common inventory of soonds (phones). Working from phonet- 
ic transcription of a large number of utterances as a base, these linguists de- 
veloped highly successful procedures for determining which sound contrasts 
played a role in any particular language. It was believed that the great prac- 
tical success of transcription as a tool for language description rested on a 
narrow p^iysical base, and that, in principle, an acoustic definition could be 
given for each phone. In this view, speech was conceived as a kind of sound 
alphabet, in which each phone is conveyed by a discrete package of sound with a 
characteristic spectral composition. The pervasiveness of this assumption has 
been noted by Denes (1963:892): 

f * . 

The basic premise of [most speech-recognition] work has always been 

that a oi^e-to-one relationship existed between the acoustic event and 
the phoneme. Although it was recognized that the sound waves associ- 
ated with the same phoneme would change according to circumstances, 
there was a deep-seated belief that if only the right way of examin- 
ing the acoustic signal was found, then the much sought-after one-to- 
one relationship would come to light. 

In fact, th^ perceptual skills that underlie phonetic transcription have, 
never been explained well enough that an algorithm could be written to permit a 
machine to do the job. From our present perspective it is clear why no one has 
been able to develop a voice-operated typewriter based on the startegy outlined 
above. First, there are no clearly bounded segments in the acoustic waveform of 
roughly phonemic size; that is, there are no acoustic units available for setting 
up a correspondence with phonemes. Second, even if boundaries are arbitrarily 
imposed on the continuous signal, the segments corresponding to a particular 
phoneme often vary considerably in their acoustic composition. Moreover, any 
one of those acoustic segments, \ransf erred to a different phonemic environment, 
might be heard as a different phoneme altogether. Not only do the physical 
attributes specifying a particular phoneme vary markedly, but the same physical 
attribute can specify different phonemes depending on the context. 

THE CONTINUOUS SIGNAL DOES NOT REVEAL THE SEGMENTATION OF THE PHONEMIC MESSAGE 

The conclusions stated above are the results of three lines of investiga- 
tion begun in the mid 1940s and continuing to* the jpregent. We turn now to re- 
view briefly the nature of the evidence. 

Of special importance from our standpoint are a series of tape-cutting and 
tape-splicing experiments that had the effect of shaking the general confidence 
that phonemes are conveyed by isolable bits of sound. These experiments failed 
to find any way to divide the signal on the time axis to yield segments of 
phoneme size. For example, given a consonant-vowel syllable such as ££, there 
is no way to cut the piece of magnetic tape so as to produce the consonant /g/ 
alone. Some vowel quality always remains. Moreover, if a consonant-vowel syl- 
lable is cut at some point, the consonantal portion may not be heard as the same 
phoneme/when spliced to a recording of a different vowel. Schatz (1954), for 
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example, found that the consonantal portion of an utterance of /pi/ was heard 
as /k/ when it was joined to the vowel /a/. Harris (1953) and Peterson, Wang, 
and Sivertsen (1958) independently concluded that assembled speech made by 
splicing together prerecorded segments is not generally intelligible when the 
units are smaller than roughly a half-syllable. 3 Some investigat.ors (e»g., 
Cole and Scott, 1974a, 1974b) continue to argue that speech perception m^y be 
based in large part on the detection of acoustic invariants for phonemesV They 
have claimed that a one-to-one correspondence can often be found, that spliced 
segments can preserve their identity when transferred to new phonemic contexts* 
However, much of this apparent "invariance" disappears wh^n one is careful to 
cut the initial segment sufficiently short that no trace of the subsequent vowel 
remains (cf. Kuhl, 1974). 

A second important development is the study of spectrographic displays of 
speech. This work was made possible b^ the Invention of the sound spectrograph 
(Koenig, Dunn, and Lacey, 1946) during World War II and by the general avail- 
ability of such devices for research during the postwar years. The specti:ograph 
displays, in graphical form, the time variations of the spectrum of the speech 
wave. This representation of the sound patterns of speech is valuable for the 
information it gives about articulation. The energy in speech sounds is concen- 
trated in a small number of frequency regions that appear on a spectrogram as 
horizontal bands (called "fotmants"). The location of the foirmants on the fre- 
quency scale reflects ,the primary resonances of a talker's vocal tract (Fant, 
1960). Since the shape of the vocal cavity changes at the Joining of successive 
consonants and vowels, the formant frequencies may be seen to modulate up and 
down as one scans a spectrogram along the time axis. However, efforts to locate 
discrete information- bearing units along the time axis have met with repeated 
failure. The phonemic and syllabic segments ^ which are so clear perceptually, 
have no obvious correlates in a spectrogram, as evidenced by the fact that 
spectrograms are very difficult to read, even" after much, experience (Fant, 1962; 
but see also Kuhn and McGuire, 1975).^ 

Figure 1 shows spectrograms of two syllables, bib and bub. Note that the 
form'ant frequencies are nonover lapping for the entire duration of the sylla- 
bles, not just in the middle portion. Although the syllables differ phone- 
mically (i.e., perceptually) in only one segment (the medial vowel), acousti- 
cally they differ throughout. 

Failure to find obvious acoustic cues in spectrograms led to a third line 
of investigation: a variety of experiments with synthetic speech, produced by 
devices that place acoustic parameters under the experimenter's direct control. 
Early work by researchers at Haskins Laboratories made use of hand-painted ^^^^^^ 



It does not follow from this result that the minimal perceptual unit is larger 
than the phoneme. The inference we would draw is that decisions about phoneme 
identity are made with regard to information distributed over the whole sylla- 
ble (and sometimes, perhaps, over a number of syllables). 

In remarking on the difficulty of reading spectrograms our point is not that 
the spectrogram does not represent the relevant phonemic Information, but 
rather that the ear has readier access to the brain's phonemic decoder than the 
eye. 
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patterns resembling spectrograms that were converted into sound by a photoelec- 
tric device, the Pattern Playback (Cooper, Liberman, and Borst, 1951). The- 
Pattern Playback and subsequent computer-controlled electronic synthesizers have 
made it possible to do analytic studies in wh£ch one parameter is varied at a 
time, to determine which parameters were critical for particular phonemes. Only 
through systematic psychophysical experimentation of this sort has it been pos- 
sible to locate the linguistically relevant. information in the speech spectrum 
(Cooper, Delattre, Liberman, * Borst , and Gerstman, 1952; Liberman, 1957; 
Liberman, Harris, Hoffman, and Griffith, 1957; Liberman, Cooper, Shankweiler, 
and Studdert-Kennedy, 1967). A majot conclusion of this research is that, in 
general, there' is no simple one-to-one correspondence between perceptual units 
and the acoustic structure of the signal. To be successful, synthetic speech 
must encode the information for phonemes 'into acoustic patterns at least a half- 
syllable or full-syllable in length. 

These findings make it possible to understand why the design of a voice- 
operated typewriter proved so difficult. Phonemes are not merely joined acous- 
tically; they overlap so that two or more are represented simultaneously on the 
same stretch of sound. Conversely, segmentation is impossible because informa- 
tion for one phoneme is usually spread oyer wide stretches of the signal. Even 
if segmentation were attempted, the cues isolated would be radically different 
in another phonemic environment. As we saw in Figure 1, all four segments cor- 
responding to /b/ would differ markedly in slope and formant frequency range. 

In sum, the radically context-dependent structure of speech dooms to fail- 
ure the kind of pattern recognition procedure outlined above. A procedure that 
combines a prior stage of segmentation with an analysis of segments by a set of 
tuned-filter "detectors," operating independently and in parallel, will be in- 
sufficient as a job descriptibn of an automatic recognition machine (and insuf- 
ficient as a model of human speech perception as well) . One contemporary 
approach to the recognition problem (Mermelstein, 1974) is explicit on this 
point, acknowledging that a speech recognizer would have to extract information 
about component phonemes over longer stretches of the signal than a syllable and 
would have to incorporate rules about how that information is distributed. The 
system described by Mermelstein 4oes not assume that the segmentation and label- 
ing problems are independent. 

SYLLABLE NUCLEI AS TARGETS 

A commonly suggested strategy for speech recognition (Fant, 1970) is to 
classify first those phonemes that are mast transparent" in the signal, and 
then to use that information as a basis for determining what the more contextu- 
ally variable phonemes are. For example; the research described earlier found 
that the acoustic form of many consonants isvheavily dependent on the coarticu- 
lated vowel. If the vowel wer^ easily detected in speech signals, then it 
could be identified first and used as a basis for disambiguating the neighboring 
consonant. 

There are several reasons for supposing that vowels could be extracted 
readily by a routine based on a filter bank, though it i/ould not be possible, in 
general, for consonants, tn productions of sustained vowels, the positions of 
the formants on the frequency axis serve roughly to distinguish the vowels of a 
particular talker. For a 'high front vowel such as /i/, the first and second 
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formants are spaced far apart (on the frequency axis) while th6 second and third 
formants are spaced close together. For a high back vowel /u/, the pattern is 
just reversed. These patterns can be synthesized with a combination of steady- 
state resonances that simulate the formants found in spectrograms of natural 
vowels. Synthetic stimuli are readily labeled as vowels by listeners, and it is 
possible to generate the full complement of English vowels with two or three 
formants (Delattre, Liberman, Cooper, and Gerstman, 1952). 

This acoustic characterization of vowels as steady-state entities is rein- 
forced by articulatory considerations. The flow of speech is marked by a rhyth- 
mic pattern of syllables; each syllable contains a vowel "n\icleus" that is us- 
ually coarticulated with one or more consonants. It is usual to think of conso- 
nants as the dynamic component of speech, since they are generally produced by 
movement of the articulators, and to regard vowels as the static component, 
since they may be produced with a stationary vocal-tract configuration and sus- 
tained indefinitely. 

\ 

This contrast is emphasized by the concept of an idealized vowel as a pro- 
longed, static entity defined (acoustically) by the frequencies of the first two 
or three formants, i.e., by the primary resonances of the stationary vocal 
tract. At least for individual talkers, then, there should be a distinctive set 
of frequency values associated with each vowel, in contrast to the variable 
values associated with the talker's consonants. In that case, vowels should be 
retrievable by a simple two-stage reco^|iition procedure: it would be a 
straightforward matter to detect the presence of steady states electronically 
(thereby isolating vowel segmetits for analysis), and a filter bank could then 
determine whi^h set of critical frequencies the vowel sound best fits. 

Unfortunately for this approach, the apparent simplicity of vowels is 
largely an illusion. Vowels in natural continuoys speech, unlike the artific- 
ially prolonged vowels of the phonetics laboratory, are not generally specified 
by steady states at all. Let us see why this is so. 

First, as a result of coarticulation, vowels are encoded into the structure 
of a full syllable. The imprint of the vowel is not localized but is smeared 
throughout the entire temporal course of the syllable. Thus, information about 
a vowel is available in the transitions as well as in the steady-state portion 
(if, indeed, a steady state is even attained). This was clear in the earlier 
example of the syllables bib and bub ; in both cases, the vowel affected the 
spectral pattern of the entire syllable. Moreover, the acoustic properties of 
the vowel nucleus may b6 affected by coarticulated consonants. Measurements by 
House and Fairbanks (1953) and Stevens and House (1963), for example, indicated 
consistent changes in the duration, fundamental frequency, and fonnant frequen- 
cies and intensities of vowels, depending on consonantal context. It should 
also be noted that consonants can affect the structure of neighboring vowels if 
(by the phonological rules of the dialect) a distinction between two consonants 
is actually manifested by a difference in the neighboring vowels. For example, 
the /d/ in rider is distinguished from the /t/ in writer by the increased dura- 
tion of the vowel that precedes it. Similarly, in some dialects of English, a 
nasal phoneme, such as the /n/ in pants , is realized by a nasalization of the 
preceding vowel; thus the spectral structure of the vowel /ae / will vary marked- 
ly depending on whether pats or pants is spoken. Because the coarticulation 
effects between consonants and vowels do not operate in one direction alone, but 
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are two-way effects, there is no obvious acoustic invariant that characterizes 
a vowe.l in all consonantal contexts. 

A second major source of variance in the acoustic structure of vowels is 
the tempo of articulation. During rapid rates of speech, steady-state configur- 
ations may never be attained at all. Acoustic analysis of rapid speech supports 
the hypothesis of articulatory "undershoot," since syllable nuclei often do not 
reach the steady-state formant frequency values characteristic of vowels in 
slowly articulated syllables (Lindblom, 1963; Stevens and House, 1963). 
Lindblom and Studdert-Kennedy (1967) fouiid that listeners showed a shift in the 
acoustic criteria that they adopted for vowels (i.e., there was a shift in the 
phoneme boundary between them) as a function of perceived rate of utterance. 
Apparently, human listeners compensated for this simulated articulatory under- 
shoot by perceptual overshoot. These data show thatt formant transitions, which 
are generally understood to carry consonantal irif ormation, may also aid in speci- 
fying the vowel. Thus, in ordinary speech, vowels, like consonants, are dynamic 
entities that are scaled by. the pace of speaking. 

A third source of acoustic variation in vowels is associated with the in- * 
dividual characteristics of the talker. Ife perceive this variation directly 
when we identify persons on the basis of Voice quality. On the other hand, such 
individual variation is irrelevant and becomes "noise" when our intent is to 
recover the linguistic message. Inasmuch as formant frequencies reflect vocal- 
tract dimensions, it is obvious that the absolute positions of the formants will' 
not be the same for a .child as they are for an adult. The extent of the problem 
is suggested by Joos (1948:64). 

The acoustic discrepancies which an adult has to adjust for when 
listening to a child speaker are nothing short of enormous — they 
commonly are as much as seven semitones or a frequency ratio of 3 
to 2, about the distance from /e/ to /u/. 

" -/ 

Somehow, in spite of this, we manage to understand* small , children ' s speech rea- 
sonably well and they ours. This is especially remarkable given that the dif- 
ference between children and adults cannot be described by a simple scale fac- - 
tor. The vocal tract not only increases in size but changes in shape, and the 
consequent changes in the acoustic output are correspondiiigly complicated. In- 
deed, Fant (1966) has argued that the assumption of an invariant relation be- 
tween formants one and two for a givei]i vowel is just as untenable as the assump- 
tion that the absolute formant frequencies of the vowel are invariant for all 
talkers. T^ius, the relation between utterances of a syllable by an adult and a 
small child is not the multiplicative relation that obtains between versions of 
a melody played in different keys. 

Having discussed variation based on physical differences in the sound-pro- 
duction apparatus, we should also mention differences that are social in origin, 
reflecting local variations of dialect within the larger language community. 
These ° variations are associated, of course, with geographical region, ethnic 
group, and socioeconomic class. Additional sources of talker-related variation 
are idiosyncratic speech mannerisms, emotional state, and fatigue. These 
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sources, in addition to those we have dis'<^ussed above, pose enormous difficul- 
ties to the design of an automatic^ recognition device.^ 

No one, to our knowledge, has seriously considered how an automatic speech 
recognition routine would adjust its drlt^ria to compensate for the variations 
associated with coarticulation, tempo, and talker. When we consider the magni- 
tude , and variety of variations that we take in our stride as perceivers, we be- 
gin to realize something bf the complexity of the relations between the signal 
and the phonemic message. The difficulties encountered by the task of machine 
recognition command a new respect for the subtlety and versatility of the human 
perceptual apparatus and lead us to a new appreciation of the abstract nature of 
speech perception. 

WHAT SPECIFIES A VOWEL? 

The idea that vowels San be defined as fixed sets of steady-dtate values is 
an oversimplification that bears little relation to the structure of natural 
speech. We have found it necessary to reopen the question of what specifies a 
vowel, and we wish to introduce some recent findings as a case study in the 
problem of perceptual constancy. 

Vowels, as we noted earlier, are traditionally defined by formants. Vowel 
quality is associated with concentrations of acoustic energy in a few relatively 
narrow portions of the frequency spectrum; energy in the regions between these 
bands is generally weak and has little perceptible effect on vowel quality. In 
distinguishing among vowels, the lowest two formants are traditionally thought 
to be the most significant; the contribution to perception of the third and 
higher formants is problematical. For this reason, vowels are customarily rep- 
resented as points located in a two-dimensional space defined by the first and 
second formants. As a result of variations among talkers, the points in this 
acoustic vowel space are actually regions. A critical question for perceptual 
theory is: How much or how little do these regions overlap? 

A thorough assessment of this question was made by Peterson and his col- 
leagues (Peterson, 1951; Peterson and Barney, 1952), who obtained spec tr ©graphic 
measurements of tokens of 10 American English vowels produced by 76 talkers (in- 
cluding men, women, and children). Figure 2, which is redrawn from Petferson and 
Barney (1952) , shows the vowel space defined by measurements of formants one and 
two (F^ and F2). We note that there is considerable overlap in some regions^ 
In running speech we might expect a comparal^le analysis to show stilW^more ov^er- 
lap. The findings showed not only lack of invariance in the positioh of the 
formants in children and adults, but also considerable average differences be- 
tween men and women and considerable variation among talkers of the same age 
group and sex. 

In his pioneering monograph on acoustic phonetics, Joos (1948) had discussed 
the dilemma that such variation poses for theories of speech perception. If ^ 



Reflect, too, on the variety of transformations of the signal that might be 
produced by the commonplace feats of talking with food in the mouth, with a 
cigar between the lips, or with teeth firmly clamped on a pencil (cf . Nobteboom 
and Slis, 1970). 
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FREQUENCY OF F] IN Hz 

Figure 2: First- and second- fonnant frequencies of American-English vowels for 
a sample of 76 adult men, adult women, and children. The closed 
loops enclose 90 percent of the data points for each vowei^. category 
(redrawn from Peterson and Barney, 1952). ^ 
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different spectra are heard by -listeners as the same vowel, on what does the 
Judginent of sameness depend? It cannot, he concludes, be due to any evidence in 
the sound : 

Therefore the identification is based on outside evidence. ... If this 
(outside evidence) were merely the memory of what the same phoneme 
sounded like a little earlier in the conversation, the task of inter- 
• pretihg rapid speech would presumably be vastly more difficult than 
it is. What seems to happen, rather , is this. On first meeting a 
person, the listener hears a few vowel phones and on the basis of 
this small but apparently sufficient evidence he swiftly constructs a 
fairly complete vowel pattern to serve as background (coordinate sys- 
tem) upon which he correctly locates new phones as fast as he hears 
them. ... (p. 61) . 

Thus, in Joos's view, the listener calibrates the talker's vowel space on the 
basis of a small subset of sample utterances. The listener needs some refer- 
ence points to define the range and distribution of the talker's vowels. These 
reference signals, Joos suggests, could be supplied by extreme articulations (in 
terms of tongue height and point of tongue contact). Thus, Joos (1948) leaves 
no doubt that the coordinate system he has in mind is based at least in part on 
a model of the vocal tract: 

The process of correctly identifying heard vowel colors doubtless in 
some way involves articulation. A person who is listening to the 
sounds of a language that he can speak is not restricted to merely 
acoustic evidence for the interpretation of what he hears, but can 
and probably does profit from everything he "knows," including of 
course his own way of articulating the same phones. 

Since the publication of Joos's work, a normalization step in speech per- 
ception has been assumed by virtually everyone who has written on the subject, 
whether or not the writer accepted Joos's version of the motor theory of speech 
perception (or, indeed, any other version of motor theory). The idea tjjat a 
listener makes use of reference vowels for calibration of a talker's vowel space 
has also persisted. Gerstman (1968) and Lieberman (1973) — whose contributions 
we shall discuss presently — e^ch have taken up the reference vowel idea and de- 
fended it. What is surprising, given this overwhelming consensus, is how few 
attempts have been made to measure the ambiguity in perception of vowels that is 
directly attributable to talker variation. ' 

Joos's (1948) statement of the constancy problem implies that, ^or an un-- 
known talker, isolated syllables should be highly ambiguous from the standpoint 
of the perceiver. It is perplexing, then, that two experiments that directly 
measured the perceptual ambiguity of natural speech found little support for 
this prediction. Peterson and Barney (1952), to whoi6 we are indebted for sys- 
tematic acoustic measurements of individual differences in vowel formants, also 
attempted to assess the perceptual consequences ^of the variation they discovered 
in production. The same Recorded utterances used In making the spectrographic 
measurements were also assembled into listening testj^. Listeners had to identi- 
fy tokens of 10 vowels in /h-d/ consonantal environment; the set consisted of 
heed , hid , head , had , hod , hawed , hood , who ' d , hud, and heard. Syllables pro- 
duced by groups of 10 talkers (men, women, and children) were randomly m^.xed on 
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each listening test, ensuring that opportunities for normalization would be 
.slight. Although plots of the first two formants show that the regions occupied 
by these vowels overlap considerably, perceptual judgments were remarkably 
accurate, with 94 percent of the words perceived correctly. A similarly low 
error rate for perception of a larget set of vowels, which included diphthongs, 
was reported by Abramson and Cooper (1959) . 

These perceptual data do not support the notion that a single syllable, 
spoken in isolation, is necessarily ambiguous. Apparently, the information con- 
tained within a single syllable ±b usually sufficient to allow whatever adjust- 
ment for individual talker characteristics might be required. The history of 
the research following Peterson and Barney's (1952) study shows that this con- 
clusich^^was not generally drawn. 

The work of Ladefoged and Broadbent (1957) and Ladefoged (1967) is widely 
cited as evidence that the listener has relative criteria for vowel identifica- 
tion and that the identity of a vowel depends on the relationship between the 
fonnant frequencies for that vowel and the formant frequencies of other vowels 
produced by the same talker. As a result, vowel space must presumably be re- 
scaled for each voice a listener encounters. The Ladefoged and Broadbent (1957) 
study was designed to find out whether subjects could be influenced in their 
identifications of a test word by variations in an introductory sentence preced- 
ing it. Synthetic speech was used in order to gain precise control over the 
acoustic parameters. A set of test syllables of the form /b-vowel-t/» was pre- 
pared on the synthesizer. Listening tests were made up in which the test words 
were presented following a standard sentence: Please say what this word is . 
Variant's of this sentence were produced by shifting the frequencies of the first 
or second formants up or down. Each was intelligible despite wide acoustic dif- 
ferences. The results showed that the same test word was identified as bit when 
preceded by one version of the test sentence (i.e., one "voice") and as bet when 
preceded by a second version (another "voice") in which the first formant varied 
over a lower range. The authors conclude from perceptual shifts such as this 
that the identification of a vowel is determined by its relation to a prior 
sample of the talker's speech (provided here by the test. sentence) . 

. Ladefoged (1967) interprets these findings within the framework of Kelson's 
(1948) adaptation level theory. This theory attempts to account for the extra- 
ordinary efficiency of the compensatory mechanisms that achieve constancies, 
such as color constancy under changing illumination, by supposing that th^ per- 
ceiver scales his responses not to the absolute properties of each stimulus, but 
according to the weighted mean of a set of stimuli distributed over time. The 
introductory sentence, in the Ladefoged and Broadbent experiment (1957), is - 
understood as providing the standard or anchor, thus cr/eating an internal adap- 
tation level to which the test words are referred. We shall return to the adap- 
tation level hypothesis presently, after we have introduced some relevant find- 
ings of our own. 

If it is true that a listener needs a sampl|e of speech in order to fix the 
coordinates of a talker's vowel system, we neei;i*.to know how large a sample is 
required and whether particular vowels are more effective than others as 
"anchors." As we noted earlier, Joos (1948) believed that the best reference 
signal would be one that allows the i^istener to determine the major dimensions 
of the talker's vocal tract.. He therefore suggested that the "point vowels" 
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/i/, /a/, and /u/ might be the primary calibrators of vowel space, sinc# they 
are the vowels associated with# the extremes of articulation. Liebermah^ |ind 
colleagues (Lieberman, Crelin, and Klatt, 1972; Liebefman, 1973) agree that 
these vowels probably play an important role in disambiguating syllables pro- 
duced by a novel talker. They note that the point vowels are exceptional in 
several ways: they rej^p^sent extremes in acoustic and articulatory vowel space, 
they are acoustica^J:5^ table for small changes in articulation (Stevens, 1972) , - 
and they are th^/only vowels in which an acoustic pattern can be related to a 
unique vocal -tract area function (Lindblom and Sundberg, 1969; Stevens, 1972). 

Gerstman (1968) has made one of the few direct attempts to test the idea 
that a subset of vowels can serve to calibrate a talker's vowel space and reduc6 ^ 
errors in recq^gnition' of subsequently occurring vowels. Gersfeman developed a 
computer algorithm that correctly classified an average of 97 percent of the 
vowe;ls in the syllables produced by^the Peterson and Barney panel of 76 talkers. 
Fpr each talker's set of 10 utteranpes, the program rescaled the first- and 
second-f ormant frequency values of each medial vowel, taking the extreme values 
in the set as the endpoints. Since these extreme forinant values are typically 
associated with /i/, /a/, and /u/, the procedure corresponds to a normalization 
of each talker's vowel space with reference to his own utterances of the point 
vowels. The classification system was essentially a filter bank that classified' 
the vbwels according to the scaled valu^'-;fjor the first two formants, «and the 
sums and differences of these values. By inserting the normalization stage be-* 
tsween segmentation and classification, Gerstman' s program succeeded in reducing 
by half the errors ^ classification made by human perceivers [recall that 
Peterson and Barney' s^a(1552) listeners made 6 percent errors]. We must keep in 
mind however, that a successful algorithm is not a perceptual strategy, but 
only a possible strategy. Although it is of interest that such an algorithm 
can, in principle, serve as the basis for categorization of the signals, we are 
aware of no evidence that the human perceptual apparatus functions analogously. 
For example, it would be necessary to' demonstrate- that humans scale individual 
formants and calculate sums' and differences between them. 

It seemed to us that speculation had far outstripped the data bearing on 
the vowel constancy problem. In fact, the few studies of perceivers' recogni- 
tion of natural (as distinguished from synthetic ) speech inciicailed that isolated 
syllables spoken by novel talkers are remarkably intelligible. Therefore, it 
seemed important to verify Ladefoged and Broadbent's (1957) demonstration with 
natural speech in which all the potential sources of information that are ordin- 
arily available to perceivers are present in the signal. Similarly, Gerstman' s 
(1^68) success in machine recognition using the thtee point vbwels as calibrat- 
ing signals needed to be evaluated against the performance of humatji listeners. 
Accordingly, we designed experiments to determine the size of the perceptual 
problem posed for the listener when the speaker is unknown. This involved a 
comparison of perceptual errors, under matched conditions, when the test words 
were spoken by many talkers and when all were produced by only one talker. We 
also sought to discover whether certain vowels (e.g., the point vowels) have a 
special role in* specifying the coor4inates of a given talker's vowel space. 

HOW DORS A LISTENER MAP A TALKER'S VOWEL SPACE? 
AN EXPERIMENT TO DETERMINE THE SIZE OF THE PROBLEM 

We (Verbrugge, Strange, and Shankweller, 1974) first attempted to measure 
the degree of ambiguity In vowel perception attributable to lack of congruence 



130 



ERIC 



136 



of the vowel spaceS" of different talkers. To measure this, we presented listen- 
ers with unrelated words or nonsense syllables, so that broad linguistic context 
would make no contribution to the act of identification. Our studies of this 
problem were fashioned after Peterson and Barney (1*952). We presented nine 
vowels in the consonantal environment /p-p/; thus, the^^t consisted of /pip/, 
/pip/» /pep/, /pas p/, /pap/, /Dpp/, /pAp/, /pup/, and /pup/. In one listening 
test, the listeners heard tokdns spoken by. 15 d'ifferent voices (5 men, 5 women, 
5 children), arranged in random order. To determine what proportion of percep- 
tual error is due to uncertainty about the talker, as opposed to other sources'^ 
a second lisi^ening test was employed in which a single talker uttered all the^ 
tokens on a given test. The talkers were given no special training They were 
urged to recite the syllables briskly in order to bring about some undershoot of 
steady-state tkrgets. ' Our objective was to achieve conditions as. similar as . 
possible to normal conversational speech. ^ I 

Listeners misidentified^ an average of 17 /ercent of the vowels wheil spoken 
by the panel of 15 talkers^ and 9 percent whe|i each of three test^ was spoken by 
a single individual (a representative man, woman, and child from their respective 
groups), ^he difference between these two averages, 8 percent. Is a measure of 
the error attrlbutoble to talker variation. Although this is a statistically 
significant difference [t^(50 df) ^ 5.14, £ < .01], its absolute magnitude is * 
surprisingly small. Less than half th'e total number of errors obtained in the 
variable-talker condition can be attributed to talker variatlpn* 

Listeners can identify vowels in consoriant-vowel-consonant syllables with 
considerable accuracy even when Jthey are spoken by an assortment of talkers de- 
liberately chosen for vocal diversity. The intended vowel was" identified on 
^3 percent of the tokens In ^ test designed to maximize ambiguity contributed Sy 
vocal-tract variation. 7 in a second study, listeners were asked to identify 15 
vowels (monophthongs and diphthongs) spoken in /h-d/ context by 30 talkers. 
Here, the rate oi^ identification' errors was 13 percent overall and 17 percent 
for the nine monophthongs alone. The results of both studies are in es^ntial. ^ 
agreement with earlier perceptual jjata reported by Peterson and Barney (19^) 



We liave^aken the intended vowel of the talker as the criterion of correct 
iden^fication. That is, we hav^ defined an identification 'error ^s a respea^^ 
by fthe listener that does not correspond to the phonemic category intended by 
thd talker. It might be the^case that errors so defined are as much due to 
mispnronunciation as to misperception. No correction ^was given to talkyers dur- 
ing Recording other than to clarify orthographic confusions in a few instances. 
In thfe case of the youngest children, some cpaching^asS^quired before they 
pronpAnced the nonsense syllables. However, no adult mo^^s were provided \ ^ 
immediately prior to utterances that were included in the tests 

''^Each of the 15 talkers spoke only three tokens containing different vowels. 
These tokens were separated in the test by no fewer than eight intervening 
tokens spoken by different talkers. Listeners were unable to judge how many 
talkers were included on a test. 

■ ' ■ ' ' . •■■^ ' 
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and Abramson and Cooper (1959)\ although the error rates obtained by these in- 
vestigators were even lower than those we obtained.^ 

• 

These findings do not bear out the common as sjumpt ion of a critical need for 
extended prior exposure to a talker's speech. The' information contained within 
a single syllable appears to be sufficient in most. cases to' permit recognition 
of the intended vowel; familial ty with a voice seems to play a rather small 
role in the identification process. /j' 

ARE THE POINT VOWELS USED BY LISTENERS AS AIDS TO NORMALIZATION? 

We (Verbrugge, Strange, and Shankweller, 1974) next examined the possibil- 
ity that an Introductory set of syllables increased the likelljiood that a suc- 
ceeding vowel produced by the same talker was correctly recognized. Because we 
wished to test a Specific hypothesis about the stimulus inferrmatlon required for 
normalization, we did not employ an introductory sentence as Ladefoged and 
Broadbent (1957) had done. Instead, we introduced each target syllable by three/ 
precursor syllables; this provided three samples of the talker's vowels and 
little else. In one condition of the experiment, the precursors were /hi/, 
/ha/, and /hu/; these syllables contain examples of the talker's point vowels. 
For a second condition, we chose /hr, hae , hA/, a set of nonpoint vowels that 
(like the point vowels) are quite \^±dely separated in the space defined by the 
first and second formants. 

Neither set of precursor syllables fi^ought about a systematic reduction of 
perceptual errors in identifying the target vowels. The errors in each precur- 
sor condition averaged 15 percent (compared to 17 percent in the earlier cojidi- 
tion without precursors), but in neither case does^this difference approach sig- 
nificance.^ The principal effect of the /hi/, /hW, /hu/ precursors was to 
shift the pattern of responses somewhat, some vowels showing improved identifi- 
cation, others showing poorer identification. ^ 

The idea that normalization is specifically aided by the point vowels — as 
suggested by Joos (1948), Gerstman (1968), and Lieberman (1973) — is not supported 
by these data. In fact, no precursor syllables that ve tried were found to have 
a systematic effect. A single, isolated syllable is usually sufficient to 



We suspect that these studies made somewhat less severeMemands on liisteners* 
perceptual capacities than our own. In the Pete2:sQiL_and Barney (1952) study, 
listeners heard only 10 different talkers on a particular test. Each talker 
spoke two tokens of each of 10 /h-d/ syllables. The study yielded an overall 
error rate of 6 percent. The Abramson and 'Cooper (1959) study employed eight 
talkers, each of whom spoke one token of 15 /h-d/ syllables. The ovetall error 
rate in that study ranged from* 4 to 6 percent. An additional source o? percep- 
tual difficulty in our tests is the fact that /a/ and /of are homophonous in the 
dialect of most of our talkers. • 

9 

This result was confirmed in a separate study (cf. Verbrugge, Strange, and 
Shankweller, 1974) of 15 vowels in 7h-d/ context. When no precursors were 
present, errors averaged 13 percent. When each /h-d/ syllable was preceded by 
the syllables, /kip/, /kap/, /kup/, 12 percent of the responses were errors. 
The difference was not significant. 
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specify a vowel; prior exposure to specific subsets of vowels could not be shown 
to supply additional information. It would, seem imneceg^ary to invoke a psycho- 
physical weighting function in order to establish an internal adaptation level 
s, (cf. Ladefoged, 1967). We. may surmise that the isolated syllable is not so 
ambiguous an entity as is sometimes implied. 

Because of the repetitive and stereotyped manner in which the precursors 
were presented, some readers might be inclined to doubt whether listeners made 
full use of the phonetic information pote'ntially available and therefore to 
question whether these experiments are adequate to test the hypothesis. We can 
reply to this pbjection indirectly by referring to a further experiment in 
which the same precursor syllables did produce a measurable effect on percejition 
of a subsequent target syllable. This experiment involved the same 15 talkers 
as the earlier experiment, but differed in that the test syllables were produced 
in a fixed sentence frame: The little /p-p/'s chair is red . Each talker was 
instructed to produce the sentence rapidly, placing heavy stress on the word 
<^hair . The unstressed, rapidly artiqulated /p-p/ syllables were excised from 
the tape recording and assembled into two new listening tests. In one condi- 
tion, the /p-p/ targets were prefaced by the same tokens of the /hi, ha, hu/ 
precursors employed in the previous experiment. On this test, listeners made an 
average of 29 percent errors in identifying the vowels in the target syllables. 
In the other condition, no precursors were present. On this test, listeners 
misidentified 24 percent of the same vowels. Thus, misp^erception of target 
vowels occurred with significantly greater frequency when they were preceded by 
precursors [_t(35 df) = 2.88, £ < .01]. We may suppose that the precursors im- 
paired recognition of succeeding vowels in this^ instance because they specified 
a speaking rate slower than that at which the /p-p/ syllables were actually pro- 
duced. Thus, whereas we failed to find evidence for effects of precursors on 
normalization of vocal-tract differences, we do find evidence for adjustment to 
a talker's tempo hypothesized by Lindblom and Studdert-Kennedy, 1967), on 
the basis of preceding segments of speech. 

THE ROLE OF FORMANT TRANSITIONS IN VOWEL PERCEPTION 

Our results suggest that the identity of a vowel in a syllable spoken by a 
new talker is likely to be specified by information within the syllable itself. 
The phonetic context supplied by preceding syllables apparently seizes a func- 
tion ot^er than that of adjustment for a new set of vocal-tract parameters; it 
may enable the perceptual system to gauge the tempo of incoming speech and to 
set its criteria accordingly. We were encouraged by these preliminary fiVdings 
to look for the sources of information that specify the vowel within the syll^ 
ble, and to explore how that information is used by the perceiver in the process 
of vowel perception. 

As we noted earlier, the formant .transitions in a syllable vary systema- 
tically as a function of both the consonant and the vowel. Therefore, we might 
expect that the listener utilizes information contained in the transitions in 
recovering the identity of the medial vowel. Research on the identification of 
isolated steady-state vowels (i.e., vowels that are not coarticulated with con- 
sonants) indirectly supports this expectation. Perception of isolated vowels 
is notably unreliable. Fairbanks and Grubb (1961) presented nine isolated vow- 
els produced by seven phonetically trained talkers to eight experienced listen- 
ers. The overall identification rate was only 74 percent; rates foi; individual 
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vowels ranged from 53 to 92 percent. Slightly better identification of isolated 
vowels was obtained by Lehiste aiid Meltzer (1973) for tliree talkers, where, 
again, talkers and listeners were phonetically skilled. Fujimura and Ochidi 
(1963) directly compared the identif lability of vowels in consonantal context 
and in isolation. They found that the center portions of vowels i which had been 
gated out of CVC syllables^ were less intelligible in isolation than in syllabic 
context. 

Research bearing on this question has also been done with synthetic speech. 
Millar and Ainswort^ (1972) reported that synthetically generated vowels were 
mote tellably identified whan embedded in an /h-d/ environment\ than when acous- 
tically identical st^ady-stat^ target values were presented in isolation. 
Finally, Lindblom and Studdert-Kennedy (1967) noted that listenfe^rs used differ- 
ent acoustic criteria tp, distinguish pairs of vowels depending on whether judg- 
ments w^re made on isolated vowels or on the same vowel targets efebedded within 
a CVC environme^nt . 

There are at least two wa^^ that the transitions!! portions of the acoustic 
signal might provide information for vowel identity. One possibility is that 
..transitions play a role in specif ying talker chardcteristicsv Since the loci of 
formant transitions for a particulkr consonant vary with differences in vocal- 
tract dimensions (Fourcin, 1968; Ran4, 1971) , transitiotiia might serve as^ call- ^ 
lltation signals for noi^aliz:ation. Particularly when the phonemic identity of 
the consonants 4^ fixed and known to the listener, the transitions might serve 
to reduce the ambiguity of the vowel by providing information about vocal-tract 
charaliiteristics of the talker who produced the syllable. 

We may also envision a second possibility that is at once more general and 
more parsimonious: the acoustic specification of vowels, like consonants, is 
carried in the dynamic configuration of the syllable. In other words, ♦the sylla- 
ble as_a_whole cospecifies both : consonants and, vowel. In this view, transitions 
may be regarded as belonging to the vowel, no less than to the consonants. If 
this were true, we would expect that the perception of medial vowfels would be 
aided by the presence of consonantal transitions regardless of whether the per- 
ceiver encounters many talkerj^ on successive tokens or only one. 

To makef an experimental test of these possibilities, we (Strange, Verbrugge 
and Shankweiler, 1974) constructed a new set of listening tests that contained 
a series of isolated vowels. In one condition ^the vowels were spoken by the 
same panel of 15 talkers described abO'^e. In a second condition, a single 
talker produced the full series of Ww^ls. Together with the earlier tests with 
/p-p/ syllables, these materials allowed us to compare the relative effects on 
vowel identifiability pf two major variables: presence or absence of consonan- 
tal environment and presence or absence of talker variation within a t;est. This 
also placed us in a position to evaluate the alternative hypotheses about how 
consonantal' environment contributes to vowel perception. 

According to either hypothesis, we would expect that the perception of iso- 
lated vowels would be less accurate than the' perception of medial vowels on a 
listening test in which the tokens are produced by different talkers. However, 
the two alternative hypotheses generate different expectations concerning the 
error rate on isolated vowels and medial vowels when the talker does not vary 
within a test. If the advantage of consonantal environment is due to lise of ^ 
transition cues for normalization, we could expect to obtain no difference 
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between performance on these two conditions, because in neither case is there a 
need for repeated calibration. Therefore, we would expect that vowel recogni- ^ 
tion would be as accurate for the isolated vowels- as for the medial vowels. If, 
on the other hand, the consonantal environment provides critical information for 
the vowel independent of talker-related variation, we would expect a difference 
in consonantal environment to affect performance whether or not talkers vary 
within a teyst. Thus, we would expect identification of isolated vowels to be 
less accurate than medial vowels even for tests in xrtiich the talker did not vary. 

^ The results for the isolated vowel tests support the latter hypothesis. 
The average error in the variable-talker condition was 42 percent (compared to 
17 percent errors on the comparable test in which vowels were spoken in /p-p/ 
environment). This increase in errors is consistent with either hypothesis. 
However, the results for the single-talker condition also showed a large increase 
in errors when there was no consonantal environment. The average error in the 
single-talker conditions was 31 percent for the isolated vowels (compared to 9 
percent errors on_medial vowels). Moreover, a vowel-by-vowel comparison showed 
that for every vowel in both talker conditions, the error rate on the isolated 
vowel was greater than on the corresponding medial vowel. Both major variables 
(Consonants Present versus Absent and Talker Variation Present versus Absent) 
were shown to produce significant differences in overall errors [F(l,94 df) « 
125.17 and 21.18, respectively, £ < <01]. The decrease in accuracy of yowfel 
recognition due to the absence of consonantal environment was approximately the 
same whether talkers varied or not (i.e., the analysis showed no significant 
interaction between variables). We may surmise, therefore, that consonantal 
transitions do not aid in specifying )a. vowel by providing information for a 
normalization stage. On the contrary, these results indicate that the presence 
or absence of transitions is much more critical for accurate recognition than 
the degree of experience with a talker's vocal-tract parameters. Whereas the 
presence of within-test talker variation impairs recognition by only about 8 
percent, the absence of a consonantal environment impairs performance by more 
than 20 percent. 

.The possibility cannot be overlooked, however, that the relatively poor 
perception of isolated vowels is attributable primarily to the talkers' inabil- 
ity to produce them reliably. Since isolated vowels do not occur in natural 
speech (with a few exceptions), talkers may produce them in peculiar ways, with 
formant frequencies uncharacteristic of the values found in natural syllables. 
Also, the characteristic relative durations of the vowels (Peterson and Lehiste, 
1960) might not be preserved by talkers in their productions of isolated vowels. 

To investigate these possibilities, we undertook spectrographic analysis of 
the tokens of isolated vowels and medial vowels used in our listening tests. 
Center frequencies of the first three formants and vowel duration were measured 
for all the tokens in the variable- talker tests, as well as for tokens of all 
nine vowels spoken in isolation by each of the 15 talkers. The data provided no 
evidence that the isolated vowels were produced in an aberrant manner. Average 
formant frequencies for men, women, and children correspond quite closely to 
those reported by Peterson and Barney (1952) (for vowels in /h-d/ environment), 
with the exception of /o/.^° When the formant frequencies of each talker's 



This deviation is due to a dialect difference between our ^roup of .talkers 
(predominantly natives of the upper Midwest) and Peterson and Barney's group. 



135 

ERIC 14 i 



1^ 



isolated and medial vowels are compared, the values are found to be highly simi- 
lar. Measurements of vowel duration also fail to account for the increased / 
error rate for isolated vowels. Although the durations of -these were for the 
most part longer than the vowels in /p-p/ environment, the relative durations of 
the nine isolated vowels were much the same as the relative durations of vov^ls 
in consonantal environment. We may suppose, therefore, that the higher error 
rate for isolated vowels compared to that for vowels in a fixed consonantal en- 
vironment cannot be explained on the grounds that isolated vowels tend to be pro- 
duced in an aberrant manner. 

The message of these perceptual data is clear: isolated, sustained vowels, 
although they correspond wjell to the phonetician's idealized conception of a 
vowel, 11 are poorly specified targets from the standpoint of the listener. 
Lehiste and Peterson (1959) found that many l^urs -^t* practice wei^ needed by un- 
trained listeners before they could identify isolated vowels accurately, even 
when the tokens were painstakingly produced by a single phonetically trained 
talker. The ability to identify tiiese "ideal" vowels may be a highly specific 
skill with little relevance to the identification of vowels in natural speaking 
situations. 

At this point the objection might still be raised that the testd used to 
measure the perceptual difficulty of medial vowels are unrepresentative of 
natural conditions. One possibility is that there may be an advantage associated 
' with consonantal context if the context is known beforehand (/p-p/ in this case), 
but that this advantage would be largely eliminated if the identity of neighbor- 
ing consonants were unknoxm (as is often the case in natural speech). To test 
this possibility, we constructed a listening test in which the target vowels 
were enclosed by a variable consonantal environment. A panel of 12 talkers (a 
subset of the original 15) spoke a series of consonant-vowel-consonant syllables. 
In each syllable, one of the six stop consonants (/b, d, g, p,~ t, k/) appeared 
before the vowel and one of the six appeared aftpr the vowel; consonants were 
selected so that each occurred equally often in each position. One group of . 
lis'tenei^s was asked to identify only the vowel in each test token; a second 
group was asked to identify the two consonants as well as the vowel. The average 
error J.n identifying the vowels was 22 percent for the first group and 29 percent 
for th^ second. Both error rates are well below the A2 percent error rate ob- 
tained on the variable- talker test with isolated vowels. In other words, even 
when listeners do not know the identity of either the consonants or the ,vowel, 
recognition is significantly more accurate for medial vowels than for isolated 
vowels. 

A second possible objection to the earlier tests with medial vowels might 
be that syllables spoken in isolation (in "citation form") are unrepresentative 
of the syllables found in rapid, connected speech. The medial vowels in rapidly 
spoken syllables might be at least as difficult to identify as isolated vowels, 
since the vocalic portiojis of such syllables often fail to reach the steady- 

' .> 

11 

. The formant space is- less compressed for isolated vowels than for medial vow- 
els. Thus, if the values of static first and second formants were the primary 
carriers of vowel quality, isolated vowels should be better perceived because 
their acoustic values are more widely separated. 
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State values characteristic of syllables spoken in citation form. The study re- 
ported earlier in this paper bears directly on this question. When /p-p/ syllables 
spoken in unstressed position are excised from a carrier sentence and assembled 
into a listening test, listeners made an average of 24 percent errors in per- 
ceiving the medial vowels. This is not much greater than the 17 percent error 
rate for perception of /p-p/ syllables read from a list, but is substantially 
less than the 42 percent error rate for isolated vowels. One might have guessed 
that the brevity of the short /p-p/ syllables and their failure to reach steady- 
state values would make them more difficult to identify than isolated vowels, 
which are longer in duration and more stable acoustically. Apparently, the 
presence of a consonantal environment more than compensates for these difficul** 
ties. 

CONCLUDING REMARKS ON THE PROBLEM OF VOWEL CONSTANCY ^ 

Let us consider what we have learned about how the perceiver might achieve 
constancy of vowel quality. In our studies of vowel perception, the objective 
was to isolate sources of vocalic information in the natural speech signal. We 
employed signals that presented as many characteristics as possj^ble of normal 
conversational speech, including a representative range of signal variations 
that result from physical differences among talkers. 

Each way of conceptualizing the vowel contains an Implied solution to the 
>problem of perceptual constancy. We first considered the assumption that the 
vowel can be characterized by a steady-state output of the vocal tract, and 
that, to a first approximation, f ixed-formant loci are associated with each 
vowel quality of all speakers. To the extent that this assumption is correct, 
the constancy problem is trivial. Only minor adjustments for variation would be 
required. 

We saw that this conception of thi vowel as a simple acoustic event, seg- . 
mented in time and in spectral frequency composition, was widely shared among 
students of speech, including those who initiated earlier attempts at automatic 
speech recognition. We have reviewed a number of findings that are incompatible 
with this view. First, steady states are the exception, not the rule, in con- 
tinuous natural speech* As a result of coarticulation of vowels with preceding 
and following consonants, the syllable is not -discretely partitioned, and the 
information for the vowel is smeared throughout the syllable. Moreover, t^e 
variability occasioned by the phonemic environment of a vowel is compounded by 
the changes that accompany different speaking rates and different vocal-tract 
sizes. In retrospect, it is easy to see why attempts to design a generally use- 
ful speech recognition machine have so ^f ar failed. 

A more sophisticated conception of the vowel acknowledges the problem of 
variability but continues to assume that vowels,"^ even in running speech, can be 
perceived with reference tp a aingla set of acoustic j/^ ^^A? V^^^ proposes 

that tokeny^offehe "same" vowel fall on a line in vowel space defined by the " ™ 
first and iBecondl formants. The formant frequencies of two talkers* vowels would 
then be ccmstant multiples of one another. We noted that this relationship 
could not ]^4^terally hold, because vocal tracts differ inrshape as well as in 
size. This rilie^ out an analogy to a melody played in a different key, or to a 
magnetic tape recording played back at a different speed. The failure of these 
analogies is revealed by Peterson and Barney's (1952) measurements of first- and 
second-formant frequei^ies in men, women, and children. The results of the 
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meaeurements (displayed in Figure 2) showed wide dispersion of fonnant values 
for different speakers with considerable overlap ot formants for neighboring 
vowels. Even when one considers only those tokens on which perfect agreement 
was obtained by listeners, much scatter among formant values is o]bserved (cf. 
Peterson and Barney's Figure 9). 

Failing to find the invariant relation preserved by linear scaling, inves- 
tigators have sought a transformation that might yield a closer approximation. 
For example, it has become an accepted practice to plot units of frequency (Hz) 
on a scale of mels (Peterson, 1961; Ladefoged, 1967). ^2 Transformation of for- 
mant frequencies to mels might be defended on the grounds that this unit re- 
flects the response of the auditory system to frequency. However, we are skep- 
tical that the constancy problem can be illuminated by a search for the right 
scale factor. Ladefoged (1967) , who attempted to reduce variability by employ- 
ing phoneticians as talkers, concluded that separation of all vowels cannot be 
attained by scaling the first- and second-f ormant frequencies, whether in linear 
fashion or nonlinear ly, as on a scale of mels. 

Although no one has succeeded in demonstrating a generally applicable scal- 
ing (normalizing) function, it is widely assumed that perceivers mu^.t apply such 
a function to each new talker they encounter.' fli^re has been speculation about 
the minimal stretch of speech required for calibration. Ladefoged 's (1967) 
application of adaptation-level theory to the problem of speaker normalization 
reflects the common assumption that some extended sample of a new talker's 
utterances is required for determining the weights that enable the normalizing 
adjustments to ^be made. As we noted, Joos (1948) and Lieberman <1973) proposed 
that ambiguity of a new talker's utterances can be resolved by reference vowels 
that permit the perceiver to construct a model of the talker's vowel space, 
scaling the input according to parameters derived from these calibration vowels. 

Li^eners can apparently adjust their criteria for pe^rception of synthetic 
vowels according to the formant ranges specified by a precursor sentence 
(Ladefoged and Broadbent, 1957). The successful performance of Gerstman's 
(1968) normalizing algorithm indicates that frequencies of the first and second^ 
formants could, in principle, suffice for this purpose. However, we doubt, as 
does Ladefoged (1967) himself, that first- and second-f ormant frequencies ex- 
haust the sources of information that specify a vowel in the natural speech 
signal . ^3 Moreover, the fact that listeners can perceive randomly ordered syl- 
lables accurately, indicates that there is little need -for mechanism that re- 
quires a sample of several syllables in order to construgt a n^nnalization 
sche^ia. Finally, in our own experiments with natural speech, we>' failed to find 
that point vowel precursors, or another set cjjE widely spaced vowels, brought 
about a systematic improvement in recognition of the following vowel. 

Our results do not, therefore, support the view that vowels are relational 
values in a metric space that must be scaled according to other vowels produced 



A mel is a psychophysical unit reflecting equal sense distances of pitch and 
bearing an approximately logarithmic relation to frequency for frequencies 
above 1000 Hz (Stevens and Volkmann, 1940). 

This is also Peterson's (1961) conclusion, based on studies of filtering. 
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by the ame talker. If that theory were correct, it is difficult to see how 
precursors fcould fail to improve recognition of an Innnediately following medial 
vowel. The presence of coarticulated consonants within the syllable proved far 
'more useful for categorizing natural vowels than prior experience with a talk- 
er's utterance. In sum, our studies failed to provide supporting evidence for 
current conceptions of the normalization process. They force us to consider 
whether there is JustiA.cation for a separate and preliminary normalization 
stage in speech perception. 

A major difficulty with all the proposals stated above ±s that they view 
the invariance problem in terms of the relation among formant frequencies of 
relatively sustained vowels. Even If such efforts to discover an algorithm were 
successful, they would not be sufficient to explain perception of vowels in 
natural conversational speech, because in such utterances a region of steady- 
state energy is rarely present in the signal. The presence of sustained acous- 
tic energy at certain "target" frequencies is not essential for identification 
of a vowel unde^ natural listening conditions. On the contrary, there is evi- 
dence that changing spectral patterns are much superior to sustained values as 
carriers of vowel quality. We cited earlier reports that isolated steady-state 
vowels are poorly perceived, even after listeners were given substantial train- 
ing and when the target vowels were spoken by phonetically trained talkers 
(Fairbanks and Grubb, 1961; Lehiste and Meltzer, 1973). The results of our per- 
ceptual studies definitely confirm the perceptual difficulty of isolated vowels. 
Listeners misidentif ied 31 percent even when all items within a given test list 
were produced by the same talker. Moreover, vowels coarticulated with surround- 
ing consonants, as is normal in running speech, were considerably more intellig- 
ible than isolated vowels spoken by the same talkers (e.g., 9 percent of vowels 
in /p-p/ environment were misidentif ied) . It seems unlikely, therefore, that 
the perceptual system operates by throwing away information contained in formant 
transitions. Indeed, Lindblom and Studdert-Kennedy (1967), in studies with syn- 
thetic speech, demonstrated that listeners use this information directly in 
their placement of vowel phoneme boundaries. Vowel identifications varied with 
direction and rate of transitions even when the formant frequency values at the 
syllable centers were held constant. In short, it is futile to seek a solution 
to the constancy problem by analysis of any acoustic cross sgction taken at a 
single instant in timt, and we must conclude that the vowel in natural speech is 
inescapably a dynamic entity. 

Heuristic procedures for automatic recognition of consonants often begin by 
guessing the identity of the coarticulated vowel, since it is known that the 
^specific shapes' of the formant transitions are conditioned by^the vowel. The 
vowel is assumed to be a stable reference point against which the identity of 
the consonant may be determined. But this, of course, presupposes that the vow- 
el is more directly available than the consonant. We have now examined a number 
of indications that the problem of perceptual constancy may be no less abstract 
for the vowel than for the consonant. We found that isolated steady-state vowels 



But see Summerfield and Haggard (1973). These investigators measured an in- 
crease in reaction time to synthetic syllables from different (simulated) 
vocal tracts, which they interpret as reflecting extra processing time re- 
quired for a normalization stage. 
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provided especially poor perceptual targets. Moreover, there were not substaii- 
tially more errors in identifying the medial vowels of rapidly spoken syllables 
(where steady states were presumably not attained), compared to errors on medial 
vowels ±n syllables read from a list. It ife clear that we can no longer think 
of defining vowels in. terms of acoustic energy at characteristic target frequen- 
cies. A recognition device based on filters tuned to specific frequencies is as 
unworkable for vowels as it is for consonants. If the idea of the vowel target 
is to be retained, it must take account of the dynamic character of the syllable. 

Lindblom's (1963) conception has this virtue. Formant contours are charac- 
terized as exponential functions that,.t^d toward asymptotic "target" values 
associated with the vowel nucleus. Tljus, a target can be defined acoustically 
even though it corresponds to no spectral cross section through the syllable. 
Our perceptual data presented here are compatible with this view that vowels are 
specified by contours of moving formants with certain invariant properties over 
stretches of approximately the length of a syllable. 

We may find a parallel to this conclusion in studies of speech production. 
Investigation of the manner in which phonemes are joined in the syllable reveals 
context-dependent relfi^tionships similar to those we have noted in the acoustic 
signal. Lindblom's (1963) dynamic theory of vowel articulation is, in fact, an 
attempt to explain how contextual influences on the acoustic and phonetic pro- 
perties of vowels are produced. According to this view, undershoot in running 
speech is brought about by inertia in, the response of the articulators to motor 
excitations occurring in rapid temporal succession. Invariant neural events 
corresponding to vowel targets thus fail to bring articulators to the positions 
they assume when the vowel is produced in a sustained manner. 

Lindblom's (1963) inference of articulatory undershoot during rapid speech 
has been confirmed by cinefluorograpnic data (Gay, 1974). His account of the 
mechanism of undershoot has not gone unchallenged, however. MacNeilage (1970) 
presented a different view of the inherent variability of speech production. 
Basing his conclusions on extensive electromyographic studies of context effects 
in articulation, he argued that variability of muscle contraction is not to be 
understood merely as an unfortunate consequence of mechanical constraints on 
articulator motion, but as necessarily built into the system in order to permit 
attainment of relatively invariant target shapes. Gay (1974) cited his cine- 
f luorographic findings in support of MacNeilage' s hypothesis that variability of 
gesture must be regarded as a design characteristic. However, unlike MacNeilage, 



Our understanding of the vowel has been influenced by Gibson's (1966) approach 
to the problems of event . constancy in visual perception, which is to seek reg- 
ularities in the stimulus pattern that can only be defined over ttime. A simi- 
lar approach is taken by Shaw, Mclntyre, and Mace (1974), and bj^ Shaw and 
Fittenger (in press). We tend to agree with these authors that the dynamic 
invariants specifying an event may be perceived directly by perceptual systems 
that are appropriately tuned. While Lindblom (1963) and Lindblom and Studdert- 
Kennedy (1967) offer a dynamic characterization of vowels, they appariantly 
^ made the usual assumption that only temporal cross sections can be directly 
j)^ceived, and they supposed that vowel perception is mediated by a process of 
analysis-by-synthesis in which the dynamic Invariants are used to compute pos- 
sible input patterns*. 
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he concluded that there was variability not only of gesture, but of spatial tar- 
get, since he failed to find invariance of vocal-tract shape for a central vowel 
/&/ that held across speakirig rates and consonantal environments. The same con- 
clusion was drawn by Nopteboom (1970),, who aifgued that the_ kinds of reorganiza- 
tion that occur in talking with the teeth clenched make it difficult to retain ^ 
the idea of invariant spatial targets. Just as the attainment of a specific 
acoustic target value is not necessary for the successful perception of a vowel, 
it is probably the case that the attainment of a specific target shape is not 
necessary for its effective production. Thus, in production as in perception, it 
has become increasingly difficult to entertain the notion of* an invariant target 
for each vowel, as long as the meaning of invariknce is restricted to a specific 
vocal-tract shape and its resonances. It is likely that the units of production, 
like the units of perception, cannot be defined independently of the temporal 
dimension. Speech, viewed either as motor gesture or as acoustic signal, is not 
a succession of static states. Invariance in vowel productioil, then, can be 
"discovered only in the context of the dynamic configuration of the syJ.lable. 

The reader might wonder at this points whether the various productions and 
acoustic forms of a vowel are so heterogeneous that no coherent physical defini- 
tion (however abstract) could be found that embraces them all. Perhaps _t he re- 
quired invariance is not to be found in the acoustic signal at all. If it is 
not, a radical solution to the constancy problem is to suppose that the variants' 
of a phoneme are physically unrelated, and to assume that the brain stores sep- 
arately a prototype of each vowel and consonant for every phonemic environment. 
If we can extend to speech perception an argument made by Wickelgren (1969) 
doncerning its production, then Wickelgren' s hypothesis of "context-sensitive 
allophones" is such a proposal. However, the proposal has little to recommend 
it. Halwes and Jenkins (1971) find a number of flaws, two of which are critical. 
First, the proposal fails to capture the phonological relations that are known 
to be important in understanding both the production and perception of speech. 
Second, it ignores the "creativity" inherent in the production of speech that 
permits the reorganization of articulatory movements to maintain intelligibility 
even when normal speech movements are blocked, as when talking with the teeth 
clenched or with food iri the mouth, or when under the influence of oral anesthe- 
sia. 

In light of the evidence we have surveyed, it is obviou's that attempts to 
understand* the psychophysical constancy relations in speech have failed to dis- 
cover transparent isomorphisms between signal and perception. This failure has 
led many to doubt whether a psychophysics qf speech could ever illuminate the 
constancy problem. But certainly, it does not follow from the complexity of the 
. psychophysical relation that the signal fails to specify the phonemic message 
uniquely. The emphasis that current theories place on the relational nature 
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-We tfc^ubt that a "distinctive feature" description of speech would allow a 
simpler psychophysical relation to be stated. Phonemes are often character- 
izetd by a set of component features that are the basis for contrastive plioneme 
pairs. For example, /b/ ^d /d/ contrast in place of articulation , while /d/ 
and /t/ contrast in voicing . There is substantial evidence that such features 
are integral to the perceptual analysis of speech. We believe that the same 
arguments apply to the detection of distinctive features as apply to the 
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of the vowel is misleading because it underestimates the richness of the signal 
in natural speech, a richness that is attested by the gi:eat tolerance of the 
perceptual system for a degraded speech signal (as in noisy environments or 
after filtering). To abandon the search for acoustic invariants because the 
psychophysical relations are complex would surely be a backward step. It should 
be appreciated, however, that commitment to the principle of invariance does not 
bind us to a literal isomorphism between signal and percept. The weight of 
evidence conclusively opposes a one-to-one mapping of perceptual segments and 
their dimensions on physical segments and their dimensions. In the case of 
vowels, we have argued that the invariants cannot be found in a temporal cross 
section but can only be specified over time.-'-'^ For vowels, as for other phon- 
ological segments, a major goal of research is to discover the appropriate time 
domains over which invariance might be found. 
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"Coperce^tion'V A Preliminary Study 
Bruno H. Repp* 



ABSTRACT 



The present paper defines '*CQpetceptlon," In analogy to coartlc- 
ulatlon, as the Influence of one segment .on the perception of another 
segment In an utterance. The appropriate measure for this Influence 
Is reaction time, 4n one of several possible tasks (classification, 
"same-different" judgments, monitoring'). The factors that may govern 
^ coperceptlon are discussed In terms of three (not mutually exclusive) 
hypotheses: (1) temporal Integration, (2) p(Brception-pto3uctlon~c 
relations, and (3) pei;;ceptual units. A preliminary study Is reported 
/ that demonstrates coperceptlon of a stop consonant with the final 

vowel In VCV utterances, although the former was partially Indepen- 
dent of the lattef^t the acoustical level. "Same-different" judg- 
ments about the consonants In two successive VCVs were Influenced by 
the similarity betweeti the final vowels. This result extends similar 
findings obtained by others in GV^ syllables. Some methodological 
issues in the investigation of^ cx^perception are, discussed. 



INTRODUCTION 

« 

Plsonl and Tash (1974) demonstrated that, at the level o£ "same-different" 
judgments, the consonant and the vowel in (stop-vowel) syllables are not pejr- 
ceptually independent of each other. When the consonants in two successive CV 
syllables were to be compared, '^same" judgment$/were f^stejr when the vowels were 
also the same </ba/-/ba/) than when they 'were different (/ba/-/baB/) j and. "dif- 
ferent" judgments were faster when the vowels' were different (/ba/-/daB/) than 
when they were the same (/ba/-/da/). Sikilar effects were exerted by the conso-r 
nants when the Identity of the vowels was to be judged. Wood and Day (1975) ob- 
tained precisely the same results in a dif-ferent- but related paradigm, speeded 
classification; both consonant and vowel classifications (binary-choice r^aii-r 
tion times) were faster ^^en the irrelevant phoneme was held constant thahlwhen 
it varied randomly. ^ ^ ' \ ^ 

^University of Connecticut Health Center, Farmington* 
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Qlearly, neither "same-different" judgments nor classification responses 
are based exclusively on abstract, independent phonemic "codes. These codes, 
which are subjectively so salient, are presumably higher cognitive constructs 
that are not directly accessed in rapid perceptual judgments (cf. also Savin and 
Bever, 1970). The phonemic codes in immediate perception, while permitting 
error-free judgments, seem to remain "context-sensitive" (Wickelgren, 1963) by 
the criterion of response latency. The, characteristics and limits of this 
context sensitivity are worth expltoring,' 

The stimuli used by Pisoni and Tash (1974) and by Wood and Day (1975) rep- 
resented optimal conditions for context, sensitivity to emerge. In stop-vowel 
syllables, the phonemes are not only contiguous but also in part acoustically 
dependent on each other: the formant transitions transmit information about the 
consonant and the vowel in parallel. Context sensitivity as a consequence of 
parallel information transmission at the acoustic level Is not a greatly sur- 
prising finding. More interesting cases arise when the phonemes in question are 
independent at the acoustic level or even separated by silence or intervening 
information. Pisoni and Tash (1974), at least, seem to expect perceptual inde- 
pendence in this case: "if the information were not trahsMtted in^ p^ 
form we would not expect differences in consonants to affect the vowel decision 
and differences in vowels to affect the consonant decision" (p. 134). However, 
this conception may be too narrow. Before I proceed to some empirical data, 
let me propose a new term. 

Because of its analogy to the articulator^ phenomenon of coarticulation , 
the perceptual phenomenon under investigation will be termed " coperception . " 
Coperception exists whenever the perception (in terms of the speed or the accur- 
acy of certain judgments) of a particular portion or segment of the speech sig- 
nal is influenced by a preceding or following portion or segment. Coarticula- 
tion, of course, is defined as the Influence of one segment on the articulation 
of another segment (Daniloff and Hammarberg, 1973). Coarticulation may extend 
over several (acoustic or phonetic) segments, and the same may well be true of 
coperception. Both phenomena have limits in terms of a maximal time interval or 
number of segments over which they can extend* Both may work in a forward 
(left- to-right) and/or in a backward (right- to-left) direction. While left-to- 
right coarticulation may involve carryover and inertia within the atticulatory 
system, in addition to centrally planned components, right-to-left coarticula- 
tion reflects only articulator/ planning and anticipation (Paniloff and 
Hammarberg, 1973) and therefore is perhaps the piore interesting effect of the 
two. Similarly, right-to-left coperception is probably more interesting (or 
less confounded) than left-to-right coperception. In the latter, the contextual 
information has already^ entered the system and may affect subsequent judgments 
simply because of its presence in some short-term store, or because it is still 
being processed, or because it has "preset" some relevant response mechanism or 
criterion — in short, because of its "perceptual inertia." In right-to-left co- 
perception, on the other hand, the biasing context follows the segment to be 
judged, so that contextual effects will reflect only the size of the perceptual 
segment, defined as a particular time span or as a particular number or constel- 
lation of segments, over which the speech perception mechanism integrates before 
a decision ("same-different" or classification) can be reached. 

As can be seen, the formal analogy between coarticulation and coperception 
is nearly complete. Whether there is any functional analogy, in terms of the 
constraints involved or in terms of more specific relationships between adjacent 
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sounds, remains to be investigated. The most obvious question to ask is whether 
coperception cai^ be found in the absence of coarticulation in the same signal , 
or vice versa. This may be investigated by examining the coperception of seg- 
ments that do not show coarticulation, or by looking for evidence of coarticula- 
tion between segments whose perceptual Independence has been established. The 
present study had a more modest goal: by using synthetic speech , coarticulation 
effects were partially removed from utterances that tend to show such effects in 
natural speech. The question was whether perception would take advantage of 
this opportunity and show independence, or whether coperception would neverthe- 
less be found, which could then be explained either by reference to articulatory 
constraints or to more general temporal limits of perception. 

More specifically, the present study used the "same-different" paradigm 
with stimuli that went just one step beyond the CV syllables employed by Pisoni 
and Tash (1974). VoweJ^con sonant- vowel utterances were synthesized such that 
the acoustic segment preceding the stop closure was invariant with the final 
vowel. Since this acoustic segment, which is perceived in isolation as a VC . 
syllable, is sufficient for recognition of the stop consonant, the speech per- 
ception mechanism waa of fered an oppartunity to show independence^-o^f- the- f inal 
vowel in judgments of the consonant. On the other hand, Shman (1966) has shown 
that, in natural speech, the implosive transitions (which precede the closure) 
do vary with the final vowel to some degree. The explosive transitions (which 
follow the closure) were dependent on the final vowel, of course, and corre- 
sponded to those of a CV syllable. If the perceptual mechanism integrates im- 
plosive and explosive transitions across the closure (50 msec of silence), co- 
perception should occur. If, on the other hand, a response is initiated^ as soon 
as there is minimal information present to reach a decision, perceptual inde- 
pendence may be found. As an additional aspect^ the present study used not two 
but four different final vowels,' in order to investigate the Effects that their 
mere similarity may have on "same-different" judgments. 

METHOD 

Subjects 

Twelve female University of Connecticut undergraduates received course 
cljredit for jtheir participation. 

t ■ ~ . 

Stimuli i 

Eight VCV utterances, /aba/, /abse/, /abe/, /abi/, and /ada/, /adae/, /ade/, 
/adi/, were synthesized by rule on a Glace-Holmes synthesizer at the University 
of Connecticut (Storrs). All stimuli began with 90 mseo of steady-state /a/, 
followed by 50-msec transitions appropriate for a final /-b/ or /-d/, which did 
not depend on the final vowel. These implosive transitions were followed by 50 
msec of. silence, then 50 msec of explosive transitions (dependent on the final 
vowel), and finally one of four steady-state vowels of 100-msec duration. ^ The 
tQ^al VCV duration was 340 msec. The final vowel /-e/ was mistakenly synthe- 
sized to be only of 40-msec duration, so that these two utterances lasted only 
280 msec. (The results showed little effect of this factor — see below.) 

Th^ experimental tape was recorded with the help of the pulse-code -modula- 
tion (PCM)' system at Haskins Laboratories. The tape contained first a random 
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list of 80 single VCVs, 10 replications of each individual stimulus. Each VCV 
was preceded by a 100-msec warning tone (a synthetic nonspeech signal), which 
came on 500 msec prior to VCV onset. The interstimulus interval (between VCV 
offset and the onset of the next warning tone) was 4 sec. This practice series . 
was followed by five blocks of 64 randomized test pairs (i.e., five replications 
of all possible pairings of the eight stimuli). The onset-onset interval between 
the two VCVs in a pair was 490 msec. The first VCV was again preceded by a warn- 
ing tone, and the interpair interval was 4 sec. 



'The Subjects were tested individually in two sessions on different days 
lasting approximately 1 1/2 hours altogether. The subject listened over Grason- 
Stadler (Telephonies) TDH-39 earphones and operated a toggle switch with her pre- 
ferred hand. .After the toggle switch and the nature of the stimuli were ex- 
plained, the subject listened to the practice series, with the instructions to 
classify the consonant in each VCV as rapidly and as accurately as possible by 
moving the toggle switch in the appropriate direction. The two positions of the 
switch (toward ^ctd away from- the- body)--were laheied "fi" andr ^he assignment 
of these labels to the positions was counterbalanced over subjects. Subsequent- 
ly, the five blocks of pairs were presented (with re6t pauses between blocks, as 
required) . The labels were changed to "SAME" and "DIFF" (likewise counterbal- 
anced over subjects), and the subject was instructed to judge whether^the conso- 
nants in the two VCVs in a pair were the same or different as rapidly and as 
accurately as possible. The variation in the final vowel was to be ignored. In 
the second session, the subject repeated the five experimental blocks. 

^ ■ * - 

Because of equipp|Bnt malfunction, half of the subjects -listened to the stim- 
uli monaiurally; some blocks were presented to the left ear and some to the right 
ear. The remaining subjects listened binaurally, as intended. The tapes werSa 
played back on a high-quality tape recorder (Crown 860) and passed through a 
Lafayette La-375 solid-state amplifier.> The intensity was set at a comfortable 
listening level. Reaction times were recorded by a Hunter M^G 1521 digital mil- 
lisecond timer. The timer was triggered by a Grason-Stadler E7300A-1 voice- ^ 
operated relay key, which in turn was activated by the onset of the warning tone 
on the tape. The timer was mariually reset by the experimenter after she recorded 
the reaction time. - 

Although reaction times were originally measured from the onset of the warn- 
ing tone, a constant was subsequently subtracted, such that the reaction times 
were measured from the offset of the implosive transitions (the onset of the 
silent closure period) of the second VCV in a pair (of the single^VCV in the 
classification task). Median reaction times were calculated for the ten Tepli- 
cations of each practice stimulus, and for the five replications of each indiv- 
idual experimental pair in each session, omitting errors. These medians formed 
the basic data for further analysis, which was in terms- pf averages. 



Classification 

Table 1 shows the average median reaction times for classifying the mediar 
consonants as "B" or "D." It can be seen that the reaction times for "D" were 
slightly longer than those for "B," but there was a really substantial difference 
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only in one context, /-i/: /abi/ had the fastest and /adi/ the slowest laten- 
cies of all stimuli. This specific interaction reached significance (£<.03). 



TABLE 1: Average median classification reaction times (msec) and error 
percentages (in parentheses) for /ab-/ ("B") and /ad-/ ("D") 
iti four different contexts. 





/-a/ 


■ /-a/ 


/-e/ 


. I-±I 


Mean 


/ab-/ 


331 


334 


336 


322 


331 




(4.2) 


(3.4) 


(0.8) 


(6.7) 


(3.8) 


/ad-/ 


334 


337 


346 


377 


349 




(1.7) 


(1.7) 


(5.8) 


(3.4) 


(3.2) 



The error rates are al^o shown in Table 1. No stimulus caused particular 

4^fficulties and all VCVs were highl^^ identifiable^ e 

tlvey were synthetic. The differences in error rates between individual stimuli 
were almost certainly random. 

"Same-Different" Judgments 

Tq simplify the statistical and graphical analysis of the data, a "similar- 
ity" dimension was defined on the final vowels. Pairs of vowels that are neigh- 
bors in the classical vowel triangle (or quadrilateral) were considered "simi- 
lar," while' other pairs were considered "dissimilar." The similar pairs were: 
/-a/+/-ae/, /-aB/+/-e/^ and /-e/+/-i/; the dissimilar pairs were /-a/+/-e/, /-a/+ 
/-i/, and /-aB/+/-i/. The median reaction t/imes were subjected to a four-way 
analysis of variance (Bock's} 1975, pseudo-multivariate method for repeated 
measurements), with the factors: (a) Type of Response ('^same" versus "differ- 
ent"), (b) Vowel Context (identical - similar - dissimilar), (c) Consonant of 
First VCV (B versus D) , arfd (d) Sessiops (first^ versus second). The reason for 
the third factdr will become evident below. In addition, three-way analyses of 
variance were conducted on "same" and "different" latencies separately. 

The results are shdVn in Figure 1. "Same" latencies were significantly 
faster than "different" latencies (£<.002), and there was also a significant main 
effect of Vowel Context (£<.03). Both results make little sense, however, in 
view of the- striking interaction between the two factors (£<.0004): "skme" 
latencies were much faster than "different" latencies when the final vowels were 
IdeYitical,. somewhat faster when the vowels were similar, and a bit slower when 
the vowels were <riissiTnllar. (The numbei\ Of individual subjects showing faster 
"same" latencies — by at least 10 msec — in the three contexts was 12; 9, and 2, 
respectively.) The two component contrasts of the interaction, which contrasted 
:|dentical with nonidentical vowels and similar with dissimilar vowels, respec- 
tively , were both significant (p< . 0001 and £<.002, respectively). 

' The vowel content had a striking effect on "same" latencies (£<.0p2), while 
<^■■ its effect on "different" latencies was less pronounced (£<.05)^ and in the 

opposite direction. "'Same" latencies were significajitly faster in identical , 
contexts than in .nonidentical 'con texts (£<.0004, shown by all 12 slibjects), 
while "different" latencies were significantly slowed down in identical contexts 



151 



ERLC 



500 - 



^450 

CO 



0:400 - 



350 - 



a: B-B 

A D-D 

□ B-D 
■ D-B 



▲ SAME 




SAME 



SIM. 



DISSIM. 



VOWEL CONTEXT 



Figure 1: Average median reaction times (RT) of "same" and "different" judgments 
in three different classes of vowel context. The small symbols repre- 
^ sent particular sequences of the consonants being judged. 
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(£<.02, shown by 8 subjects). Vowel similarity, on the other hand, affected 
only "same** responses, which were much faster in similar contexts (p<.0009, 
shown by 10 subjects); "different** responses showed no difference at all. How- 
ever, four subjects surprisingly showed faster **dif f erent** latencies in similar 
contexts than in dissimilar contexts, while^ three subjects showed a diffei^ence 
in the opposite direction. v 

Figure 1 shows further that the nature of the consonant in the first VCV ex- 
erted a large effect on the response times: they were much slower when the 
first consonant was a D than when it was a B. Because of high variability, the 
significance level of the effect was not very high (£<.02). The difference was 
only marginally significant for **same** latencies alone (£<.06), although the 
difference was in the same direction for all subjects, but jnore consistent for 
"different** latencies (£<.007), owing to enormously large differences for some 
subjects (but two subjects showed an inverted effect). The nature of the second 
consonant apparently had no influence on the reaction times (cf. Table 2). 

Finally, there was also a significant dectease of reaction times with prac- 
XiP.e^ ^ti-e^^. JExom^tha fjLraJ: to the s econ d session^ <£<^2)^ Howe ver> there were 
no substantial changes in the pattern of results. The effect of identical con- 
texts on the speed of **same** responses was somewhat less pronounced in the 
second session (£<.04) but still^very striking. The same was true with respect 
to the effect of the first consonant (£<.05). 

The classification of the vowels into similar and dissimilar pairs may have 
been somewhat arbitrary and was partially confounded with the' individual vowels. 
Also, one of the vowels, /-eA, had a shorter dijfation than the others. There- 
fore, the detailed results are shown in Table 2. 



TABLE 2: Average median latencies (msec) fpr **same** (upper-left and lower-right 
quadrants) and **diff erent** (upper-right and lower-left quadrants) 
judgments. Latencies for identical vowel contexts are underlined. 

Second VCV 
/ab-/ /ad-/ 



> 



CO 

u 







/-a/ 

1 




l-zl 


/-!/ 


/-a/ 




l-zl 


/-!/ 


Mean 




/-a/ 


332 


408 


404 


451 


399 


383 


381 


408 


396 


/ab-/ 


/-eb/ 


385 


393 


369 


447 


389 


416 


411 


430 


405 




/-e/ 


418 


383 


301 


391 


454 


432 


472 


428 


410 




/-I/' 


405 


454 


414 


327 


439 


440 


•425 


434 


417 




/-a/ 


509 


469 


464 


till 


399 


424' 


453 


462 


457 


/ad-/ 


/-eb/ 


466 


528 


460 


467 


409 


408 


410 


457 


451 




/-t/ 


464 


494 


486 


453 


456 


420 


354 


443 


446 




l-tl 


439 


- 432 


463 


474 


437 


524 


465 


356 


449 




Mean 




445 


420 


435 


'423 


431 


421 


427 


429 
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There are only a few atypical results at the level of individual stimulus 
pairs. For example, in /-ae/ context, "same" latencies were not faster when the 
vowels were identical;, there is no obvious explanation for th£s result. The 
fastest "same" latencies and, in one quadrant, the slowest "different^' latenbies 
occurred in identical /-e/ contexts. This is interesting, since it may reflect 
the additional effect of final vowel duration, a nonphonetic variable. There 
were no effects reflecting the different classification times observed in /-i/ 
context during the practice series. 
I 

The pattern of error percentages, when plotted as in Figure 1, was So simi- 
lar to that of the latencies that a separate figure is superfluous. In other 
words, the error frequencies were highly correlated with the latiencies of correct 
responses and showed the same effects of vowel context, as well as the influence 
of the f first consonant. While it is well-known that the errors and the latencies 
in "same-different" judgments are positively correlated, the degree of similarity 
found here was surprising. The errot percentages of the individual subjects 
varied widely, from 0.8 to 17.2 percetit, but eight of the subjects remained below 
5.0 percent. Nevertheless, all subjects showed the same basic pattern of reac- 
tion times 8^o_ that . la^tencies and errors meraly^uappeax 4:^ be alternative r^ 
tions of the same underlying processes. 

DISCUSSION ^ 

'?he basic outcome of the present study is clear: in the context of a "same- 
different" judgment task, the perception of the consonant in VCV utterances is 
not independent of the final vowel, although there is a part of the acoustic 
signal that is not dependent on the final vowel and sufficient for recognition 
of the consonant. This is a clear case of right-^to%J.ef t "coperception," as de- 
fined in the Introduction, and somewhat less trivial than coperception in CV 
syllables, where the acoustic cues for the*" stop consonant are largely dependent 
on the following vowel. 

There are several possible explanations for why coperception occurred in the 
present situation; they are all interesting and not at all mutually exclusive: 
(1) Temporal integration : One possibility Is that the information that was co- 
perceived with the co^onant occurred within the time span over which the per- * 
ceptual mechanism integrates, so that it could not possibly be ignored. This is 
plausible, since vowel- independent and vowel-dependent transitions were sepa- 
rated by merely 50 msec of silence. On the other hand, the integration or tem- 
poral storage limits of the speech processor seem to be on the order of 250 
msec, according to estimates from backward-masking paradigms (Massaro, 1972, 
1974; Repp, 1975a); or perhaps only about half As long, according to estimates 
from certain other dichotic masking studies (e.g., Pisoni and McNabb, 1974), 
from dichotic discrimination errors (Repp, 1975b), or from the percfeption of 
dichotic pulse trains (Huggins, 1974). Both estimates are sufficient to explain 
the present results. The next step would be to try to exceed the limits of the 
temporal integration period by increasing the duration of the silent closure 
period. If it is sufficiently long (perhaps about l50 msec), geminate consonants 
will be heard, i.e., /ab-ba/ instead of /aba/ (Dorman, Raphael, Libermanj and 
Repp, 1975). It is a reasonable guess that coperception will disappear precisely 
at that point, but this remains to be^tested. If the hypothesis is confirmed, it 
would provide support for Massaro* s (1972) conjecture of a 250-msec integration 
period in speech perception, since the duration of the transitions that are 
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integrated will have to be added to the silent interval to obtain the maximal 
integration interval/ - 



However, there may be other factors at work than purely temporal ones. 
(2) Perception-production correlations : One relevant finding is that, in natural 
speech, the implosive transitions in VCVs show some degree of coarticulation with 
the final vowel (Ohman, 1966). Since speech perception has often been shown to 
take into account the dynamics and con^trainta of speech production, we may have 
found another manifestation of this functional symbiosis. In other words, coper- 
ception may occur whenever coarticulation normally occurs, even if there' is '^o 
actual evidence of coarticulation in the speech signal (becdase the synthetic 
speech has been deliberately generated without it, as in thJ present study). It 
will be difficult to produce unequivocal support' for this hypothesis, but it may 
be critically tested, for example, by examining whether coperception occurs 
across synthetic consonant clusters followed by vowels that include /u/ which, in 
normal speech, produces antic4.patory lip-rounding throughout the whole cluster 
(Kozhevnikov and Chistovich, 1965; Daniloff and Moll, 1968). 

^3^) -^jPe^ ^ptual -4 mit8" ; - There- ia-^ third possible exp lanat-ion for copcrGGp - — 

tion in VCVs, and it concerns perceptual preferences for certain constellations 
of phonetic segments. Kozhevnikov and Chistovich (1965) have postulated that the 
CV syllable forms the basic perceptual (and articulatory) unit, so that a VCV 
utterance should be perceptually parsed into V-CV. Of course, this is consistent 
with the present results. There are obvious ways in which this perceptual unit, 
hypothesis could be further tested. For example, it predicts that there should 
be less or no coperception between the first vowel and the consonant in VCVs when 
judgments about the consonant are to be made. This is an interesting hypothesis 
to test, since the vowel-dependent transitions precede the vowel-independent 
transitions in this case, so that the predicted outcome would be somewhat coun- 
terintuitive. Another way of testing the influence of segmental factors while 
holding temporal relationships (Constant would be to replace the explosive transi- 
tions in a VCV with transitions appropriate for a different consonant, e.g., 
/ab-da/. However, it is known that, in this case, the perception of the- first 
consonant would be impaired (Dorman et.al., 1975), a result th^t in itself is 
consistent , with the result of coperception in the present VCVs and with the per- 
ceptual unit hypothesis. In order to preserve the first consonant, the closure 
Interval would have to be extended to about 100 msec; but the question of whether 
coperception occurs may then still be asked, also with regard to^ perceptual 
interactions between the two consonants. , ^ 

Coperceptual phenomena may be influenced by further variables not yet men- 
tioned. Suprasegmental, factors, such as the relative stress, duration, or in- 
tonation of the two vowels in a VCV utterance, may play a role. Syllable, mor- 
pheme, or even syntactic boundaries may affect coperception. The kind of in- 
structions and the task may be important, too. JPor example, in speeded classi- 
fication (Wood and Day, 1975), the listeners may be comparing the stimuli with 
more abstract codes in their brain than in the "same-different" judgment t&sk 
where the first stimulus in a pair must be encoded before it is compared with 
the second stimulus. If the second stimulus occurs before the encoding of the 
first is complete — and the process of phonemic abstraction, may take more time 
than is usually allowed — coperception in some form is likely to result. The 
interval between the stimuli to be compared may therefore be an important vari- 
able. Other important Issues, such as that of the relative discriminability of 
parts of the signal, have been discussed by Wood and Day (1975). Another related 
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paradigm that iiiay be employed to investigate coperception is monitoring — rapid 
detection qf phonemic targets in a series of signals that either contain the 
target or do not. ThYfe paradigm has been, primarily connected with discussions 
of the perceptual unit hj^o thesis in the past, since it can also be easily 
applied to larger-size units such as syllables or words (Savin and Bever, 1970; 
tehlste, 1972; McNeill and 'L^ndigf 1973). Each of the three paradigms discussed 
here may be applied to investigate coperception, but the specific demands of 
each task will have to be studied carefully. 

Resides demonstrating the basic phenomenal of coperception, the present ex- 
periment yielded an Interesting result that must be taken into account in any 
model of "same-different" judgments: the mere similarity of the final vowels 
influenced the speed of "s|me" judgments. (The influence on "different" judg-?-.- 
ments Was not consistent.) It is known that the speed of "different" judgments 
increases with increasing dissimilarity of the stimuli being compared, and, 
similarly, the speed of "same" judgments (including incorrect responses) de- 
creases with increasing dissimilarity of the stimuli (e.g.. Repp, 1975b). Given 
that a consonant and a vowel are coperceived, the slmilari15;y between the vowels 
affects I'aame" judgments about the consonant. This yul^s out any discrete model 
that admits only decisions about identity and nonidentity^ (Pisonl and Tash, 
1974). Rather, the response latencies reflect an underlying perceptual contin- 
uum, or at least a more fine-grained discrete representation such as feature 
ntajtrices. Why the "different" latencies were not af^cted by vowel similarity 
is not obvious, and since different subjects showed diN^rences in opposite di- 
rections, no interpretation will be attempted here. 

A truly puzzling finding is the influence of the first consonant in a pair 
on both "same" and ^'different" reaction times. (At first, I suspected this 
effect to be due to an error in data transcription, but this- seems to have been 
ruled out.) A difference due to the second consonant would have been infinitely 
more plausible, since the decision tlitie includes the time 'to encode the second 
stimulus. Of course, the ^Ltuation is ambigubus: it may equally well represent 
an in teraction of the second consonant with the kind of judgment made, but this 
does by no means facilitate its Interpretation. Moreover, there were no- differ- 
ences of a similar extent in the classification latencies for single syllables. 
This effect should definitely be replicated, siace it still may be duetto some 
undiscovered artifact. 
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It has often been suggested that the syllable is the basic perceptual unit 
(e.g.. Savin and Bever, 1970; Massaro, 1972) . However, since the syllable is 
usually not precisely defined, this statement is somewhat circular. Linguistic 
definitions of the syllable may not be directly relevant to perception, and 
specific proposals, such as the primacy of the CV syllable (Kozhevnikov ajid 
Chistovich, 1965) need to be further tested. Research on coperoeption provides 
a discovery procedure for the units ±^ .speech perception — for example, copercep- 
tion (especially from right to left) should not occur across "syllable bounda- 
ries." Systematic research on the limits of coperception promises to provide 
important information about the processes involved in speech perception, and , 
perhaps about the perception of time-varying signals in gener|il. 
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Dichotic "Masking" of Voice Onset Time* 
Bruno H, Repp^ 



ABSTRACT 

When consonant-vowel (CV) "targets" are presented to one ear and 
Isolated vowel "masks" to the other ear, the percei>tion of the voicing 
feature of the target is biased by the temporal relationship between 
target and inask which acts like a "pseudo voice onset time" and com- 
petes- with the actual Voice onset time of the target. The effect is , 
especially strong Vheh there is high a priori uncertainty about the 
voicing category of the target, and it is more pronounced when the 
masking vowel lags behind than when it leads in ^ime.*^ The bias is 
stronger when the masking vowel is the same as the vowel of the tar- 
get, but it is present with a different vowel mask as well. There' 
tends to be a right-ear advantage that is strongest when the masking 
vowel leads in time. The fundamental frequency at masking vowel onset 
has an additional influence: the lower it is, the stronger is the 
tendency to identify the target as voiced. Two simple additive models 
of the vowel "masking" effect are rejected* The complexity of the 
effect suggests dichotic interaction at the phonetic level. The 
present experiments demonstrate that a temporal relationship may be 
masked by another temporal relationship, and that the voice onset time 
across ears may act as a phonetic cue.^ 

INTRODUCTION 

7 ' •■ 

In a study of the dichotic masking of consonants by vdWels (Repp, 1975b) 
three main results were reported, all of which were more or less unexpected and 
were revealed only by a rather detailed breakdown of the data. The task con- 
sisted of identifying consonant-vowel (CV) syllables (from the set: /ba/, /da/, 
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/ga/, /pa/, /ta/, /ka/) presented to the righl^ ear while isolated vowels C/a/) 
were simul^Weously presented to the left earNgct varying stimulus onset asyn- 
chronies (^OAs) with respect to the CV syllables. The three Asults were: 

(1) The vowel "masks" interfered primarily with the perception of voicing in the 
opposite ear and only little with the perception of place of articulation. ^ 

(2) When the onset of the masking vowel preceded the onset of the target sylla- 
ble, there was an increased tendency to give voiced responses (/ba/, /da/, /ga/), 
but when the vowel lagged behind, there was a rapid- shift in the^ opposite direc- 
tion and voiceless responses dominated (/pa/,- /ta/, /ka/). (3) Of two vowel 
masks with different pitch contours, the vowel' vith the higher fundamental fre- 
quency led to more voiceless responses than the one with the lower pitch, and 
this effect seemed to be. Ikidependent of the bias due to the temporal relationship 
between target and mask. ' 

One possible fexplanat;JLon proposed for these finding! was that the auditory 
information from the two ears is combined at a relatively early perceptual stage 
and that the interfering vowel masks the voice or^set time (VOT) of the target 
syllable by substituting another voicing cue: the perceptual mechanism mistaken- 
ly^ accepts the SOA between target syllable and marking vowel ^s the VOT of the 
target syllable. ^Naturally, the onset of the vowel mask implies the onset of 
voicing, i.e. of fundamental frequency.) 

The present two experiments attempt to replicate and extend the earlier 
fortuitous findings (Repp, 1975b) in a somewhat different, methodologically im- 
proved paradigm. One restrictive feature of the previous study was that the 
target syllables were presented at relatively low intensities (below asymptotic 
intelligibility) and that the observed effects of the masking vowel were pro- 
nounced only when it was at a higher intensity than the target syllables. In 
the two studies to be described here, all stimuli were presented at higher, and 
equal, intensities, but the VOT of the target syllables was systematically 
varied. By presenting target syllables from a VOT continuum (Lfsker and 
Abramson, 1967), the actual VOT of the target syllable was juxtaposed to the 
"pseudo-VOT" presumably simulated by the SOA between target and mask. It was ex- 
pected that syllables with a VOT close to the category boundary, that is, -with 
high uncertainty about their phonetic category, would be affected most* 

The effect of fundamental frequency was reinvestigated in the first experi- 
ment by using three different pitch contours for the masking vowels. In addition 
to assessing the effects of SOA and pitch in a more precise fashion, the present 
experiments examined laterality effects by varying the ear to which the targets 
were presented (which was fixed in Repp, 1975b). ^The vowel-masking paradigm is 
interesting with respect to the ear advantages often found in dichotic experi- 
ments: it lies somewhere between the familiar paradigm of competing CV syllables 
for which a right-ear advantage is usually obtained (for example, Studdert- 
Kennedy and Shankweiler, 1970) and that of competing vowels, which typically 
show" no ear difference (e.g.. Darwin, 1971). More specifically, we seem to be 
dealing here with the perception of a single feature, voicing, which is repre- 
sented at the acoustic level by a temporal relationship, VOT. Likewise, the 
competing information, the "pseudo-VOT" due tA SOA, ia a temporal relationship 
(between the two ears, in this oase^ . This kind of perceptual* competition is a 
novel situation, but a right-ear advantjage may be expected, since, there Is some 
evidence that the left hemisphiepa is superior for fine temporal discriminations 
(e.g., fifron, 1963; Halperlrf, liachshoJ0 and Carmon, 1973), apart from its specif- 
ically linguistic capacities. ' 

\/ 
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Method 



EXPERIMENT I 



Subjects . Of the ten subjects who participated, six were paid volunteers 
recruited through advertisements on' Yale University campus, and four were unpaid 
volunteers from the staff of Haskins Laboratories. | One of the latter was ai\ ex- 
perienced listener; his results were included, sinc^ they did not differ sys- 
tematically from those of the other nine subjects who had rarely or never lis- 
tened to synthetic speech before. There were seven females (hence, a subject 
will be referred to as "she" in the icollowing) and three males. One female was 
left-handed. All subjects were native speakers of English an4 claimed to have 
normal hearing. • ■£ '"^ 

The results of tl!(|se ten subjects will be compared iffi^^ highly 
experienced listener the author), who repeated ''^h^62?pe^^^ six times, 

in o^der to assess the effects of experience and pracA^i;A^ particular 
,task,-and to produce stable data for a sin^ld' listener. •^Ichi^autfior^^^i^^^^^ 
lidded and a native speaker of the Austrian dialect of Qerinan. (While the re- 
sults fpr BHR will be reported below, his data are not ^hown or included in any 
of the figured.) 

, Stimuli . The target stimuli were ten synthetic syllables taken from two 
VOT continua, /be/-/pe/ and /de/-/te/. (For convenience, they will be referred 
to as B-P and D-T confinua.) The VOTs of the five syllables from each continuum 
' were 10,^20, 30,^40, and 50 msec, simulated by noise excitation and first-formant : 
cutback prior to voicing onset. All syllables were produced on the Haskins 
Laboratories parallel formant synthesizer. Theit duration was 300 msec, and 
their fundamental frequency fell linearly from 130 to 90 Hz. 

The masking vowels were three steady-state /e/ vowels of 240-msec duration, 
differing only in their pitch contours. The starting frequencies wer^ 112 Hz 
Xlowf, 122 Hz (medium), and 136 Hz (high); they all fell linearly to 90 Hz at •the'* 
end of the vowel. 

The dichotic tapes were constructed with the aid of the pulse- code-modula- 
tion (PCM) system at Haskins. Each target syllable occurred in combination with 
each of the three masking vowels at^ each of five SOAs: -10, 10, 30, 50, and 70 
msecv (A negative SOA indicates that masking vowel onset preceded target sylla- 
ble Wset.) Moreover, the target stimuli from the middle of the two continua 
(VOT =^0) — which were expected to fall closest to the category boundaries — were 
replicat^ed four times, the adjacent ^stimuli (VOTs « 20 and 40) twice, and the end- 
point stimuli (VOTs =10 and 50) only once. Thus, the total stimulus set con- 
sisted of 2 "^(target continua) x 10 (5 VOTs: l + 2 + 4 + 2+l«m replications) x 3 
(masking vowels) x 5 (SOAs) = 300 stimuli, which were completely randomized. 
The inter^^mulus interval was 3 sec. 

' Procedure , The subjects were tested individually or in small groups in a 
single session lasting about 90 minutes. The experimenter (BHR) usually joined 
the subjects in listening. The stimulus tapes were played back from an Ampex 
AG-500 tape recorder through a listening station interface to Grason-Stddler 
(Telephonies) TDH-39 earphones. Output intensity was calibrated on a Hewlett- 
Packard voltmeter. The average intensities (peak deflections) of the target 
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syllables and the masking vowels were equalized at approactmately 73 dB SPL. 
'Usually, the left and right channels were interchanged by means of an electronic 
switch, but in some cas^s (because of problem's with the switch) by reversing ; the 
earphones. ; ^ ? 

■ : . % ■ ■ 

Each subject first listened men aurally to a random series of 150 syllables 
and tried to identify each as "B", "P", "li", or "T"i by writing down one response 
-ffor each syllable. Subsequently, the whole series of 300 dichotic stimuli was 
presented, with the targets in the ear that had previously received the monaural 
stimuli. The subjects were instructed to try to ignore the vowels and identify 
the syllables as accurately as possible. After a break,'* the cliannels were re- 
versed ^ and a second dichotic series was presented, in a different random order 
and with the targets in the opposite ear. Finally, ;another list of 150 monaural 
syllables was presented in the opposite e^r. The sequence of (target) ears was 
counterbalanced betweei> subjects (and across sessions for BHR) . 

The purpose of the experiment was explained to the subjects after the 
session was completed. During the experiment, the subjects were not informed 
that VOT, SOA, and vowel pitch varied, and none of these variations became 
really obvious while listening, as gathered from the subjects' remarks and from 
the author's own impressions. When BHR listened, his attention was not drawn to 
any of these variables, and he never attempted to "listen for them" or ' compensate 
for any of their known effects. Rather, he tended to perform the task passively 
and automatically, occupying his mind with other things while his hand wrote 
down the responses, in contrast to the naive subjects who concentrated hard. 

Results 

Monaural identification . The monaural identification functions for the ten 
subjects are shown in Figure 1. It is evident that fewer voiced responses were 
given to the stimuli from the B-P continuum than to the stimuli from the D-T con- 
tinuum, or, in other words, the dental boundary was farther to the right (by 
about 6 msec) than the labial boundary, as expected (Lisker and Abramson, 1967). 
However, two of the ten subjects showed no pronounced difference in their two 
boundaries. Several ^bjects had difficulties in hearing "B"s, so that the 
average B-P identif icatl6n function did not reach asymptote at the shortest VOT 
(10 msec) . , 

> 

Not surprisingly, BHR's identification functions were steeper than those of 
the nkive subjects, but they showed the same separation of the two continua. 
His average category boundaries (the boundaries in individual sessions varied 
, over a range of 5 msec) were found to lie 3 msec farther to the right than the 
average boundaries of the ten subjects, a highly significant^ difference in terms 
of response distri'butions. This may be a reflection of his native language: 
German voiceless stops tend to be more aspirated than their English counterparts, 
and this may lead to a corresponding difference- in the perceptual criteria for 
voicelessness . 

Not shown in Figure 1 are the place errors, that is, the confusions between 
the two continua. For BHR and several of the subjects, such confusions were ex- 
tremely rare. The great majority of place confusions was contributed by two sub- 
jects who frequently substituted dental for labial consonants (one of them only 
in the first identification series). Parenthetically, it is interesting to note 
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Figure 1: Average percentages of voiced responses (PVR) to monaural stimuli 

from labial and dental VOT continua in the context /-e/. The promin- 
ence of the individual data points (open - filled - large filled) is 
in proportion to the number of observations. The small triangles and 
squares above and below the PVR=0 level, respectively, represent the 
distributions of the individual category boundaries (PVR = 50) on the 
two continua. (The B-P boundary of one subject fell beyond the VOT 
range shown and is not represented.) 
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that one of these two listeners (who improved later in the session) showed the 
labial category boundary in her dental confusion responses. The other subject' 
(who did not improve) seemed to follow the dental boundary in her substitutions. 

The effect of SOA . The results of the dichotic condition are shown in 
Figures 2 and 3/as percentages of voiced responses to the two continua. On each 
continuum, the five target stimuli are plotted separately, but in the statisti- 
cal analysis they were treated together in a single weighted percentage score to 
which an arcsine transformation was applied. Also, because of problems of sta- 
tistical power, the SOA factor was reduced to' two levels by contrasting the two 
shortest with the two longest SOAs, omitting SOA =30. The resulting factors in ' 
the four-way analysis of variance [Bock's (1975) pseudo-multivariate method for 
repeated measurements] were: Ears (targets right vs. targetS/lef t) , Place (B-P 
vs. D-T), Pit-ch (of the vowel mask: low-medium-high), and SOA (-10 and 10 vs. 
50 artd 70). The between-subjects factor, "Order of Target Ears" (left-right vs. 
right-left) was initially included but omitted after it was found to have had no 
significant effects. In other words, the effects discussed here did not decrease 
with practice. Likewise, no change over sessions was observed for BHR. 

Figures 2 and 3 show the clear effect that SOA had oJ the percentages of 
voiced responses (£<.0007). The results of BHR showed the effect to the same 
extent (£<.0007). it was most pronounced for stimuli close to the category 
boundaries. The overall pattern indicates that the relative sizes of the "posi- 
tive" and "negative" biases (increase vs. decrease in voiced responses) were 
dependent on the control score for a stimulus, that is, they were sensitive to 
the range of variation possible in either direction. In addition, the negative 
bias seemed to be somewhat more pronounced than the positive bias. 

The main effect of Place was sigrlificant (£<.04): more voiced responses 
were given to the D-T continuum (cf. Figure 1). This was especially pronounced 
for BHR (£<.006). ' The Place x sOA interaction was significant for BHR only 
(£<.004): for him, the effect of SOA was stronger on B-P than' on D-T, while the 
ten subjects actually showed an opposite tendency (cf. Figures 2 and 3)'. 



In the figures, it may be -noted further that the "positive" effect on the 
B-P ^continuum seemed to decrease between SOA = 10 and S0A«-1Q; this tendency w^s 
less pronounced, or absent, on the D-T continuum. For BHR, the "negative" effec^ 
also seemed tq decrease between SOA =50 and SOA =70. In no case, however, did 
either effect reach the (monaural control) asymptote within the range of SOAs 
investigated (except for crossing it in the middle range). Stimuli from tlte 
ends of the D-T continuum were virtually unaffected by SOA in BHR*s data, and 
Figure 3 shows a related tendency. 

The effect of Pitch . The effect of vowel pitch is shown in Figure 4. The 
curves have been averaged over all ten stimuli, since the Place x Pitch and 
Place X Pitch x SOA" interactions were nonsignificant. As can be seen, the funda- 
mental frequency at yowel onset exerted a strong effect in the expected direc- 
tion; that is, high- pitch led to' a relative decrease, and low pitch to a relative 
increase in voiced responses. The main effect of Pitch was highly significant 
(£<.0p02), and so were both its compronent contrasts, which compared high and low 
pitch, respectively, with medium pitch. The Pitch effect was equally pronounced 
for BHR (£<.002). In general, high pitch was sufficient to eliminate the posi- 
tive bias at short SOAs, but low pitch did not eliminate the negative bias at 
long SOAs. . . . ■ ' : 
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Average /percentages of voiced responses (PVR) to the five stimuli 
from the B-P continuum, in the presence of. vowel masks in the other 
ear at five different SOAs. The prominence of the individoial data 
"points and of their connecting lines is in proportion to tl^, number 
of observations. The monaural control PVRs (from Figure 1) are shown 
at the right (C) . Large symbols represent V0T*30; small filled 
symbols, VOT « 20 (top) and 40 (bottom); and small , open symbols, 
VOT-10 (top) and 50 (bottom). The data are averaged over the Pitch 
and Ears factors. 
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Figure 3: As Figure 2, for the D-T continuum. 
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Figure 4: Average percentages of voiced responses (PVR) at five SOAs in the 
presence of dichotic vowel masks with three different fundamental 
frequencies at onset (LO, MED, HI). The average monaural PVR is shown 
here as a horizontal line (C) . The data are weighted averages of all 
stimuli from both continua. 



ERIC 



167 



171 



From Figure 4, it is also clear that the Pitch effect began to converge at 
S0A"50 and suddenly disappeared completely at S0A*70, although a negative bias 
was still present at this SOA. The Pitch x SOA interaction was highly signifi- 
cant (£<.0008) [for BHR, too (2.<.007)]. The functions for BHR were remarkably 
similar to those in Figure 4, except that his difference between high and medium 
pitch disappeared by SOA ■ 30 and even tended to become inverted at the two long- 
est SO As. 

Ear differences . Ear differences were expected to be manifested as an Ears x 
SOA interaction, such that the better ear showed both a smaller positive bias at 
short SOAs and a ^mailer negative bias at long SOAs. As shown in Figure 5, there 
was an indication of an average right-ear advantage (REA) but only at short SOAs, 
particularly SOA =-10. The Interaction did not reach significance (£>.24). On 
the other hand, BHR showed a pronounced REA and a highly reliable interaction 
(£<.0004). For him, too, the REA was especially pronounced ^t SOA =-10. 

The REA also tended to interact with Pitch. The ten subjects did not show 
any REA with high-pitch vowels, a tendency that approached significance (£<.07). 
The author showed a significant Ears x Pitch x sOA interaction (£<.02), due sole- 
ly to the high-medium pitch contrast (£<.005): the REA was more pronounced at 
medium pitch at short SOAs, but at high pitch at long SOAs. The Place ^\dts- 
interaction also approached significance for BHR: his REA was more pronounced on 
the B-P continuum (at short SOAs only) than on the D-T continuum. Note that this 
last effect (more voiced responses to the left ear on the B-P continuum) is in 
agreement with a similar tendency observed in the monaural task. 

Place errors . Tlje ten subjects committed 7.8 percent place errors, which 
compares with 6.8 percent in monaural identification (or 3.8 percent, if only 
the second monaural series is considered) # This is a relatively slight increase, 
and accuracy for place identification remained high. There were slightly more 
place errors in the right ear than in the left ear-, but closer inspection showed 
an interaction with SOA: at the shortest SOA, there was a ^A, while at the two 
longest SOAs, there was a left-ear advantage. This trend was nearly significant 
(X^(9) =^15.3, .05<£<.10). There were no other conspicuous patterns; note espe- 
illy 



cially that place errors appeared to be equally frequent at all SOAs. 

Summary of results . (1) When. the target syllables ar6- taken from a VOX 
continuum', isolated vowels in the other ear have a marked influence on the per- 
centage of voiced responses. This percentage increases at SOAs shorter than 
about 30 msec and decreases at longer SOAs. The effect is most pronounced for 
target syllables close to the category boundary. Both the positive and the neg- 
ative bias on voiced responses exceed t-he range of SOAs used here (-10 to 70 
msec). 

(2) The fundamental frequency at masking vowel onset systematically influ- 
ences the percentage of voiced responses, higher frequencies leading to fewer 
voiced responses and lower frequencies leading to more voiced responses. This 
effect disappears quite abruptly at SOA « 70. 

(3) While a gtoup of naive subjects showed no significant REA, the resjilts 
of BHR demonstrate that a REA may occur in thi^ task. The REA tends to be ififcst 
pronounced when* the masking^vowel leads in time. / J 
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(4) Identification of the place feature Is only little affected by the In- 
terfering vowels, and the place errors follow no clear pattern. 



(5) All thesfe effects apparently do not change with practice and (with the 
possible exception of the REA) are equally present In naive subjects and In an 
experienced listener. 

EXPERIMENT II 

Introduction 

The second experiment was conducted for several reasons. First, there was 
the question whether the effects observed In Experiment I and In my previous 
study (Repp, 1975b) are specific to masking vowels that are phonemlcally identi- 
cal with the vowel of the target syllables. To investigate this question, an 
identical vowel mask was compared with a phonemlcally very different .vowel mask. 
Second, the range of SOAs was jelctended by 30 msec to both sides, in order to see 
whether the effects of the masking vowel would reach their asymptote within this 
range. Third, in an attempt to obtain more precise data in the region of the 
category boundary, the SOAs were spaced more narrowly in this critical region. 
The syllables were chosen from B-P and G-K continua, which are known to differ 
even more in their VOT category boundaries than labial and dental continua 
(Llsker and Abramson, 1967; Zlatin, 1974). 

A fourth interesting aspect of the second experiment ±& its use of the 
vowel context /-I/. Summer field and Haggard (1974) ^ave demonstrated that VOT 
is of , different salience as a voicing cue in /-a/ and /-i/ contexts. In /-a/ 
context, there is a substantial f irst-f ormant transition that, in conjunction 
with a delay in voicing onset, may act as a voicing cue ("transition detection" 
in the voiced part of the signal; cf. also Stevens and Klatt, 1974). To a lesser 
degree, this applies also to the /-e/ context of Experiment I. In /-i/ context, 
on the other hand, the f Irs t-f ormant transition is minimal or absent, and VOT is 
likely to be the only cue to voicing (in the absence of a burst, ^as in the pres- 
ent syllables). On these grounds, the biasing effect of th^ "pseudo-VOT^' (SOA) 
was expected to be even more pronounced in /-I/' context (targets of Experiment 
II) than in /-e/ context (targets of Experiment I), since the transition of the 
first f ormant was more pronounced in the latter than in the former. 

Other "changes with respect to the first atudy Included slightly higher in- 
tcinsities of both targets and masks, a short practice series at the beginning, 
and identical flat pitch contours for targets and masks (which permitted binaural 
fusion when the target and masking vowels were the same and overlapped). Pitch 
was not varied in this study. ' 

Method ^ 

Subj ects . Eight paid volunteer subjects participated. Some had partici- 
pated in earlier experiments involving synthetic speech (but not in Experiment I) 
but all were more or less-naive listeners. There were* five females and three 
males. One male was left-handed. Again, BHR served as a comparison subject iji 
six replications of the experiment. 

Stimuli. Thig time, the target syllables were ten sjmthetfc syllables from 
two VOT continua, /bi/-/pi/ and /gi/-/ki/. The VOTs of the 'five labials were 
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5, 15, 25, 35, and 45 msec, while those of the velars' were 10 msec longer, viz. 
15, 25, 35, 45, and 55 msec. The masking vowels were /i/ and /a/, each 250 msec 
long. Fundamental frequency was constant at 110 Hz for both targets and masks. 

There were eight SOAs: -40, -10, 10, 20, 30, 40, 60, and 90 msec for the 
labial targets, and 10 msec longer for the velar targets, i.e., -30, 0, 20, 30, 
40, 50, 70, and 100 msec. Target syllables from th^ ndddle of the continua were 
replicated three times and adjacent syllables twice. Hence, there were 2 
(target continua) x 9 (5 VOTs — 1 + 2 + 3 + 2+1-9 replications) x 2 (masking 
vowels) X g (SOAs) ^ 288 stimuli. - 

Procedure . Playback intensity was approximately 78 dB SPL for both targets 
and masks. Each subject first received a random practice series of 40 binaural 
syllables, which consisted of ten replications of each of^ the four "endpoint" 

stimuli of the two continua, that is, supposedly good instances of /bi/, /pi/, 

/gi/, and /ki/. The subject simply listan&d and compared the sounds with the 
correct responses on a list she had in hdnd. Subsequently, 144 monauflral sylla- 
bles were presented for identification, followed by the experimental series of 
•288 dichotic stimuli; and after a break the whole in reverse, with the target 
syllables i^ the opposite ear, just as in Experiment I. (All other details of 
method were the same as in Experiment I.) 

Results 

Monaural identification . Figure 6 shows the monaural voicing scores for 
the eight subjects. This time, the B-P continuum caused little trouble, but 
most subjects had considerable difficulties with the G-K continuum. One subject 
even gave inverted responses and tended to hear /g/ as /k/, and vice versa; her 
data are not included in Figure 1. Most subjects improved in the course of the 
experiment and did better on the second identification 'series * (the functions in 
Figure 1 are based pn both series), but the voicing identification function for 
G-K remained rather flat and did not reach asymptote at VOT«'55. 

On the other. hand, BHR gave equally consistent responses to both continua. 
In his data, there was a clear separation of category boundaries (which was, indi- 
cated only at longer VOTs in the data of the eight subjects). This separation 
was about 15 mse*c, as compared to 6 msec in Experiment I, which is the expected 
outcome . 

The eight subjects showed an average tendency to give more voiced responses 
to the left ear, but only on the G-K continuum, and due to only four listeners. 
However, BHR showed the same trend but much more consistently (x (1) °19.9, 
£<.0001) on the G-K continuum, and to a lesse?r degree, also on the B-P continuum 
(as in Experiment I) . k 

Place confusions were quite rare (2.2 Percent) and occurred mostly on the 
B-P continuum. Thus, with the exception of one subject (the one with the invert- 
ed responses)^, the subjects' responses "staye4^on the G-K continuum," despite the, 
difficulties in voicing discrimination. BHR committed no place^rrors at 
all. No other patterns were observed in monaural pldce confusions, including 
absence of an ear difference. ' * 

The effect of SQA . The results are shown in Figures 7 and 8. In the sta- 
tistical analysis, the eight SOAs were reduced to two levels by dividing the SOA ^ 
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Figure 6: Average percentages of voiced responses (PVR) to monaural stimuli 

• ftom labial and velar VOT continua in the context /-i/. ^ The promin- 
ence of the individual data 'points (open - filled - large filled) is 
in proportion to the number of observations. The small triahgles 
' above the PVR"0 level represent the distribution of the induLvidual 
B~P category bound.aries (PVR "50). (The individual -G-K boundaries ^ 
are not shown becat^e they were quite erratic.) \ 
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continuum into two halves and ignoring the lO-mftec difference in SOAs for the two 
continua. (Strictly speaking, the voicing onset asynchrony , which was designed 
to be the same for the two continua, was substituted for SOA.)- 

As in Experiment I, SOA had the expected effect on the frequency of voiced 
responses. In fact, it was more dramatic than in Experiment I, as predicted, 
and was present even with the poor G-K continuum. The main effect of SOA (two 
levels) was highly significant (£<. 001) . Similarly for BHR (£<.0001). It inter- 
acted with the Place factor for the eight subjects only (£<.02), reflecting the 
relatively weaker effect of SOA on the G-K continuum, a direct consequence of 
the relatively poor voicing discrimination for these stimuli (Figure 8). Only 
BHR, on the other hand, showed a significant Place main effect (£<.002), due to 
a higher overall percentage of voiced responses on the G-IC continuum. The fact 
that this expected main effect was not significant for the eight subjects is also 
in accord with the absence of a clear separation between the two identification 
functions (Figure 6), while BHR showed a much more pronounced separation. * 

Figures 7 and 8 show again that targets close to the category boundaries 
were affected most strongly, and that the relative extents of positive and nega- 
tive biases depended on the baseline level. These effects were especially clear 
for BHR, while the results of the eight subjects showed some susceptibility to 
interference in all stimuli. The figures further demonstrate that the range of 
the SOA effect has not yet i^een bracketed by using SOAs between -40 and +100 
msec: neither all positive nor all negative effects have reached their asymp- 
totes at these SOAs. (See Figures 9 and 10 for better pictures of the average 
trends.) 

The nature of the vowel mask . Figure 9 compares ^he average results for the 
two vowel masks. It is evident that /a/ did have a clear effect on voicing per- 
ception, but its effect was somewhat weaker than that of /i/. However, neither 
the Vowel main effect nor its interaction with SOA reached significance, although 
it is apparent that the difference between the two masks disappeared at the 
shortest and at the longest SOAs. (Both continua showed precisely the same 
pattern.) This interaction is not quite what would be expected if /a/ had had a 
truly weaker effect than /i/. On the other hand, BHR showed precisely the kind 
of interaction expected (2.<.002), reflecting a genuinely attenuated but neverthe- 
less present biasing effect of /a/. 

Ear difference . Ear differences are shown in Figure 10. The pattern is 
strikingly similar to that in Experiment I (Figure 5). The eight subjects again 
showed a REA only in the region of .positive voicing bias,, and there was no clear 
interaction with SOA but instead a nonsignificant -(£<. 10)- tendency to give more 
voiced responses"! to the left ear. Closer inspection of the data showed that the 
9-P continuum actually exhibited a weak interaction with SOA, while G-K showed 
only a strong main effect throughout (cf. the corresponding monaural tendency). 
The pattern in Figure 10 results from the averaging of these two effects. How- 
ever, the triple interaction, Place x Ears x SOA, did not reach significahce. 

As in Experiment I, BHR showed the interaction expected from a true REA 
(2.<.008). There was an interaction with Place (£<.02), reflecting a strong bias 
toward giving more voiced responses to the left ear On the G-K continuum ,(cf . the 
monaural data) . The REA of BHR in the positive bias region was mainly due to 
G-K, while the REA in the negative bias region was due only to B-P. Given the 
higher frequency of voiced rest)onses to the left ear in the monaural task, BHR's 
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results may also be described as showing approximately equal positive voicing 
bias in both ears, but a negative bias only in the left ear and not at all in 
the right ear. His average REA was most pronounced at S0A = -40. 

^ Place errors . The place confusions committed by the 'eight naive subjects 

\^ increased from 2.2 percent (monaural) to 3.3 percent (dichotic), again only a 

small increase. However, the errors showed a definite pattern: they were much 
more frequent with /i/ as Aask than with /a/ (5.0 percent vs. 1.6 percent; 
X (1) =45.2, £<.0001). In other words, only the /!/ mask led to an increase in 
place errors at all. Place confusions were also more frequent in the left ear 
than in the right ear (4.0 percent vs. 2.6 percent; X^(l) -5,0, 2.<-03). No clear 
pattern with respect to the target stimuli or SOA was present, except that errors 
tended to be more frequent in the G and P regions. (A similar trend existed 
monaurally.) BHR committed only five place errors (0.1 percent), which — perhaps 
not by coincidence — all occurred in the left ear and with /i/ as mask. ^ 

Summary of results . (1) The basic effect af vowel masks on the perception 
of voicing was replicated with a different set of st;imuli in the context /-i/. 
The effect was even more striking than in Experiment I and still exceeded the 
boundaries of the SOA range (-40 to 100 msec). 

(2) The effects of a vowel mask identical with the vowel following the 
target consonant (/i/) and of a very different vowel mask (/a/) were qualitative- 
ly similar, but the effects of /a/ tended to be weaker. 

(3) Again, BHR showed a clear REA, while the group of eight subjects showed 
only a tendency in that direction. Also, the REA was again most pronounced at 
negative SOAs. 

(4) ► Place confusions were increased only with the /i/ mask and tended to . 
exhibit a REA. » 

DISCUSSION 

The Nature of the "VOT Masking Effect" 

The present experimehts confirm and extend the observations of Repp (1975b) 
by demonstrating a clear biasing effect, of isolated vowel masks on the perception 
of the voicing feature bf a stop-consonant target in the other ear. This effect 
is very different from the kind of interference observed in typical dichotic 
listening tasks, where two equivalent stimuli (e.g., CV syllablesV compete fxjr d 
single processor or response mechanism (Pisoni and McNabb, 1974; iepp, 1974a). 
Given consonants (in CV syllables) as targets, isolated vowel masks contain no 
competing auditory or phonetic information that could be mistakenly accepted as 
a consonantal feature. Rather, it is the tempo.ral relationship between target 
and mask that itself provides competing information. Thus, in contrast "to a 
standard masking paradigm, the competing information changes with SOA. Of 
course, the relevant aspect of the target, the voicing feature, is also defined 
primarily as a temporal relationship, VOT, which was the only voicing cue. in the Mi 
synthetic syllables used here. In a way, then, a temporal relationship is being 
"masked" by another temporal relationship in the present paradigm. 

This explanation in terms of derived auditory or phonetic parameters implies 
that the interaction between the competing information takes place after 
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temporal relationships have been extracted from Vand from between) the auditory 
signals, that is, at a higher auditory or phofeetic level. This is in agreement 
with a general explanation of dichotic interference, in terms of interactions of 
derived stimulus features (e.g., Blumstein,- 197A) . It 'is conceivable, however^ 
that similar effects arise from the combination of more elementaxy auditory in- 
formation from the two ears. The two stimuli could fuse with each other and be 
acoustically integrated before temporal parameters are extracted from ^is com- 
bined signal, which would -'represent a complex case of simultaneous "central" 
auditory masking (Repp, 1975b). There are indications, however, that a higher- 
level interaction is involved. One hint is thaft the perception of the place fea- 
ture is so little affected. If the masking- vowel were truly superimposed on the 
target stimulus, it should have a stronger simultaneous masking effect on the 
transition portion of the target syllable whenever the two (^verlap in' time. How- 
ever, place errors showed no clear trends with respectto SOA. A second argument 
is the relatively wide temporal range of the voicing bias, which is estimated to 
reach from about -70 to 120 msec SOA. This range by itself points toward a 
higher level of interaction. Perhaps the strongest argument concerns the reduced 
effects of vowel masks on target stimuli that lie farther away from the VOT cate- 
gory boundaries. This is best illustrated by considering two simple models of 
the "VOT masking effect." 

Assume that, in each target-mask combination, there are two competing voic- 
ing cues: the actual VOT of the target syllable, and the "pseudo-VOT" due to the 
SOA between target and mask onsets. If these two cues were simply weighted and 
combined (or, in terms of a discrete model, if one were substituted for the other 
in a certain percentage of cases) , the resulting average percept should lie some- 
where between the two extremes* corresponding to the two cues. This can be ex- 
pressed as 

p(V+ I VOT, SOA) = (1 - Csoa)p(V+ | VOT) -f CsoaP(V+ | SOA) (1) 

which is to be read as follows: the probability of a voiced response to a tar- 
get with a given VOT at a given target-mask SOA, p(V+ | VOT,. SOA), is a weighted 
combination 6f the probability of a voiced response for this target in isolation, 
p(V+ I VOT), and the probability of a voiced response given only the SOA ("pseudo- 
VOT") cue, p(V+ I SOA). The CgQ^ factor is a measure of the relative influence of 
the SOA cue; it is 0 when there Ss no influence and 1 when p(V+ | VOT, SOA) de- 
pends on SOA only. The reason CgQ^ depends on SOA is that the influence of the 
vowel masks clearly diminishes as SOA increases in either direction. 

One can easily estimate p(V+ | VOT) bv the identification function for iso- 
lated syllables. On the other hand, p(V+ | SOA) is unknown and probably could not be 
measured by itself at all. Howevei;, if SOA acted indeed like a "pseudo-VOT," the 
p(V+ I SOA) should follow a, similar function as p(V+ | VOT), with the "category 
boundary" lying aVound the SOA that equals the VOT 'at the category boundary. 
However, while thJ VO^Mcraregory boundary depends on place of articulation, the 
SOA category boundary may well be independent of this withinTsyllable factor. - 

The model in Equat^Lon (1) can easily be tested. It states that, at any 
given SOA^, p(V+ | VOT, S0» should be a linear function of p(V+ | VOT), \dxh the 
slope (1 - cgg^) and^the intercept CsoaP(V+ | SOA). Plotting p(V+ | VOT,^^A) as a 
function of p(V+ | .VOT) at different SOAs is an instructive alternative way of 
summarizing the data. These functions are shown in Figures 11 and 12 for the two 
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Figure 11: p(V+ | VOX, SOA) as a function of p(V+ | VOT)~see text for explana- 
tion. Hand-fitted curves for data of Experiment I. The small 
numbers indicate the SOA represented by each curve. 
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experiments, respectively. Figure 11 is derived from Figures 2 and 3. The 
curves were fitted by hand to the 10 points corresponding to the 10 target stim- 
uli. Apparently, both B-P and D-T stimuli were fitted well by the same function, 
as predicted. In other words, the effect of SOA seems to depend only on 
p(V+ I VOT) but not on place of articulation (as far as this rough analysis goes). 
Figure 12 represents the B-P stimuli from Figure 7. The G-K stimuli were not 
fitted well by these functions and were omitted; their erratic pattern was prob- 
ably due to the perceptual difficulties the subjects had with them. 

The functions for the two experiments are essentially similar, and all are 
grossly nonlinear. (BHR showed very similar patterns.) The simple model of 
Equation (1) must therefore be rejected. The relationship between p(V+ | VOT, 
SOA) and p(V+ | VOT) can be described in terms of a family of curvilinear func- 
tions .that is characterized by two parameters: (1) the amount of (maximal) , devi- 
ation from the y=x (no effect) diagonal, which represents the degree of influ- 
ence of SOA; and (2) the point at which the function crosses the diagonal, that 
is, the p(V+ I VOT) at which a particular SOA has no effect. (This point is 0 if 
every^ SOA exerts a negative bias, and 1 if every SOA exerts a positive bias.) 
The precise mathematical representation of these functions still needs to be 
derived . 

Note that it is not true thdt SOA has no effect when SOA = VOT, a^ one might 
expect intuitively (cf. Figures 2, 3, 7, and 8) . ' Rather, the "neutral" point 
depends on SOA and on p(V+ | VOT) but not directly on VOT. It is then straight- 
forward to make the assumption that p(V+ | VOT, SOA) = p(V+ | VOT) precisely when 
p(V+ I VOT) ° p(V+ I SOA); that is, that SOA will have no biasing effect when the 
probabilities of a voiced response on the basis of each cue are equal [cf . 
Equation (1)]. Given this, assumption, the hypothetical p(V+ | SOA) functions cah 
be derived from Figures 11 and 12, and they are plotted in Figure 13. Despite 
this very rough estimation p^cedure, the functions derived from the two experi- 
ments are remarkably similar, as they should be, although the average SOA effects 
were more pronounced in Experiment II. Both functions resemble VOT identifica- 
tion functions (cf . Figures 1 and 6, the better stimuli) and have category 
boundaries [where p(V+ | SOA) ° .5] at 24- and 28-msec SOA, respectively. These 
values are in the same range as the VOT boundaries for the stimuli used here. 
The separation of the two functions in Figure 13 may be due to various factors 
(vowel contexts. Intensities, pitch contours, etc., or merely imprecision) and 
deserves na further comment at this stage.. (For the sake of fairness, it should 
be mentioned that the corresponding functions for BHR were not nearly as neat and 
had atypical category boundaries. Therefore, some scepticism about this analysis 
remains.) 

Clearly, SOA has less of an effect on stimuli remote from the VOT category 
boundary [whose p(V+ | VOT) is close to 0 or 1] than on stimuli close to the 
boundary. One reason may be that the effect comes about through "direct" masking 
of the voicing onset in the target by the (voicing) onset of the mask, which 
might be a relatively peripheral process, as suggested by Repp (1975b)-. In this 
case, the voicing onset asynchrony (VOA) would determine the extent of the bias. 
The model of Equation (1) may then be reformulated: 

p(V+ I VQT, SOA) ° (1 - c^oA^p(V+ | VOT) + CyQ^p(V+ | SOA) (2) 

The weighting factor now depends on VOA «=» SOA - VOT. This model can be tested if 
p(V+ I SOA) Is estimated by the functions in Figure X3. However, an attempt to 
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Figure 13: p(V+ | SOA) as estimated from Figures 11 and 12 — see text for explana 
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fit this model failed: the fit was excellent in the middle range of VOT, but the 
effect of^SOA on stimuli with extreme p(V+ | VOT) was grossly overestimated. 
Hence, Equation (2) must be rejected, too. 

T?he8e results confirm in a somewhat more precise fashion what Figures 2, 3, 
and 7 seemed to show: the effect of SOA depends on p(V+ | VOT) . Stimuli that 
have a low a priori "voicing uncertainty" are less affected than stimuli with 
high voicing uncertainty, even taking into account factors such as VOA. This 
seems to provide a fairly strong argument that the effect takes place at a rela- 
tively late stage in processing, vis. at the linguistic level where voicing deci- 
sions are made. It seems that an attempt to categorize (the voicing dimension 
of) the target stimulus precedes the influence of SOA, and that the degree of 
this influence then depends on whether the categorization has had an ambiguous 
outcome or not. The complexities of deriving a formal model for this case will 
be avoided here, especially since inspection of Figures 11 and 12 indicates that 
maximal SOA effects do not always occur with stimuli of maximal voicing uncer- 
tainty [p(V+ I VOT) a .5] but depend in a more complex fashion on both uncertain- 
ty and the "room for variation" in a certain direction. 

It is interesting to note that the negative bias tended to"i)e stronger and ^ 
had a wider temporal range than the positive bias. This difference is analogous 
to the "lag effect" observed in traditional dichotic paradigms (Studdert-Kennedy , 
Shankweiler, and Schulman, 1970; Repp, 1975a). Apparently, at the level of com- 
peting temporal relationships, too, the lagging information tends to displace 
the leading information more often than vice versa. This is another argument 
supporting the conclusion that the vowel "masking" effect arises at a relatively 
late stage in processing. 

Secondary Effects 

The effects of SOA were more pronounced in Experiment II than in Experiment 
I, especially on the B-P coritinuuAi. Although there were several other differ- 
ences between the two eweriments (for example, intensity levels, pitch contours, 
subjects), the most appealing explanation is with regard to the vowel context, as 
outlined eariier. In the /-e/ context, the subjects presumably could rely on an 
additional cue to voicing (the f irst-formant transition), besides the temporal 
aspect of VOT, and therefore were somewhat less susceptible to the effects of 
SOA, especially since the masking vowels hardly seemed to affect transitional in- 
formation (as shown by the small increase in place errors) . 

Experiment II further demonstrated that the biasing effect is obtained with 
a masking vowel (/a/) that is very different from the vocalic context of the tar- 
get (/-i/). Since the effect of /a/ was reduced relative to that of /i/, it 
seems that the "perceptual separation" of the dichotic stimuli ^fluences the ex- 
tent of "masking" to some degree* Especially with the flat* pitch contours used 
in Experiment II, identical vowels in the two ears tended to fuse when they over- 
lapped, resulting in a partially fused percept. This did not occur with a very 
different masking vowel, so that it could be more easily ignored. However, since 
a biasing effect was obtained even wi-th /a/ masks, the most important factor 
seems to be the onset of energy in the region of the fundamental frequeacy. It 
is interesting to note, in ad^dition, that the /i/ masTc increased place errors 
while the /a/ mask did not. For the perception of transitions in the target, the 
formant structure of the mask may well be more relevant than for the perception 
of VOT. 
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The effects of fundamental frequency variation were investigated in Experi- 
ment I. Three vowel masks that all terminated at the same pitch had strikingly 
different effects, depending on their starting frequency. The lower the funda- 
mental, the stronger tbe tendency to give voiced responses. Clearly^ this effect 
is related to "pitch as a voicing cue," as described by Haggard, Amb^r, and 
Callow (1969) . However the stimuli employed by Haggard and his collogues used 
much more extreme "pitch skips" within synthetic syllables, which occurred during 
the transitions, while the present differences were smaller and, of course, 
occurred in the opposite ear. Perhaps more comparable is the finding (House and 
Fairbanks, 1953) that, in natural speech, fundament^ frequency tends to be 
higher following voiceless stops than following voiced stops, and that this dif- 
ference extends into the following vowel. The rapid convergence of the pitch 
effect at SOA = 70 in Experiment I is interesting; it would be challenging to 
search for evidence that pitch differences tend to extend /or a comparable time 
into the steady-state vowel in natural speech. Haggard's estimate^ of about 
three pitch periods (about 30 msec) is precisely of the right magnitude. In any 
case, the pitch effect may be added to the growing list of examples that speech 
perception takes into account the constraints and dynamics of speech production. 

The final is^ue to be discussed is the right-e§r advantage. For the naive* 
subjects, it did not reach significance in either experiment, but there was a 
tendency that was strikingly similar in both studies in that it was strongest at 
negative SOAs. The author, who is knoxm to exhibit a pronounced REA in "a . 
standard dichotic listening test, showed highly significant REAs in both experi- 
ments. He also showed the strongest REA at negative SOAs. His results are 
accepted as sufficient evidence that individual ear advantages may manifest them- 
selves in th4r$ task, although the data of the naive subjects suggest that^ the REA 
is less pronounced than in dichotic tests Employing competing CV syllables. 
Following current theories of lateralization, the REA may be assumed to be due 
to transcallosal degradation of left-ear stimuli in\ ifhe presence of competing 
right-ear input (cf. Studdert-Kennedy , 1975). Apparently, an isolated vowel mask 
is sufficient to provide the competition necessary inhibit the ipsilateral 
auditary pathways, although it does not compete directly at the phonetic level. 
This is reasonable, since this kind of inhibition is certainly governed by gross 
auditory charactetiiBtics. Transcallosal degradation may lead tt3 less%Lg£4Jrate 
Identification of the voicing cues inN:he 'target. In addition, the l^?^]li^i- 
sphere may be more precise in assessing brief time intervals such as VOTi^ but 
because of their linguistic nature, left-hemisphere processing of t>he car^t syl- 
lables is highly likely. "Pseudo-VOT, " on the other hand, is a relation between 
the two ears, and its perception should not^be subject to laterality dif ferencas.- 
The pitch at masking vowel onset is a proper^ that need not necessarily lead no 
a REA; in fact, prosodic variables have often ^een associat.ed with a left-ear 
superiority. It is interesting ta note that neither BHR nor the subjects showexl 
any tendency toward an ear difference for the Pitch effect in. Experiment I. Only 
the effect of SOA was associated with a REA, Why this R^A was most pronounced at 
negative SOAs is not quite clear, but the phenqmenon cannot be ignored; it was 
too consistent. Perhaps, ipsilateral inhibition is maximal when there is com- 
plete overlap of the dichotic stimuli; or inhibition takes time to develop and 
therefore is maximal when mask onset precedes target onset. However, there is no 
corresponding phenomenon in conventional dichotic listening experiments. 



■'"Mark Haggard, 1975: personal communication. 
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While the REA in voicing perception did not reach significance foi; naive 
subjects, they showed a significant RJBA in place perception in Experiment II* and 
at the shortest SOA in Experiment I. (BHR made too few errors to show any sig-/ 
nificant ear differences.) This is. in line with the classical REA in dlchotlpx 
listening. Jt is interesting that the weak interference p*rovided by the vo<Jel 
was suificient to reveal a REA. This REA may* be a lower-level component of the • 
REA obtained when the mask is a CV syllable, since, in the ^present case, the com- 
peting phonetic information on the place dimension is eliminated (and with it a 
corresponding part of auditory competition, otherwise provided by the transitions 
of the mask) . This lower-level component may represent transcallosal degradation 
and /or auditory interference dnly, while competing CVs may add an additional REA 
owing to phonetic competition and transition-specific auditory competition. Such 
an interpretation fits well the increasing awareness that ^ar advantages may be 
composed of processing superiorities at several levels (Porter and Berlin, 1975). 

The ear differences exhibited in the monaural VOT boundaries by BHR remain 
a curious and puzzling effect. It deserves some further investigation, since it 
suggests the possibility that monaural ear differences may exist in the percep- 
tion of temporal stimulus properties. 
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The Magical Number Two and the Natural Categories of Speech and Music* 



+ 

James E. Cutting 



ABSTRACT 

While upper limits of information processing capture the inter^ 
ests of most experimental psychologists, cert^^ lower limits entice 
those interiSsted in speech perception. Thus, the magical number for 
speech is not seven but two, manifested most clearly in the phenom- 
enon of categorical perception. Small deviations from twoness are 
seen in the perception of stop consonants, whereas considerably larg- 
er deviations are seen for vowels. Recently, stop-consonant-like re- 
sults have been obtained for musical sounds differing in rise time, 
and identified as pluck and bow. Like the categories for stop conso- 
nants, those for pluck and bow appear to be natural a^id not learned 
infants as young as two months discriminate the musical sounds iti g 
manner functionally identical to adults. Mechanisms for the percep- 
tion of both speech and certain nonspeech sounds appear to be oppo- 
nent-process feature analyzers not under the conscious control of the 
perceiver. 

Eleanor Rosch (1973) found that the Dani, a^nonindus trial and nonliterate 
coiranunity in New Guinea, perceive certain colors and shapes in a manner func- 
tionally ijientical to American college sophomores. -Her result ia- interesting 
because the Dani have no color terms other than those for light and dark and no 
terms for angular geometric figures. Her methodology ^s complex and not rele- 
vant to speech research; her discussion centers more on the general area of con- 
dition than on perception, but her conclusion is central to ray theme: there are 
saii*ent stimuli in our environment that we perceive as prototypes of natural 
categories . In other words, our perceptual- apparatus is geared to perceive cer- 
tain stimuli better than others, and it warps a somewhat ill-fitting stimulus to 
be more like its natural prototype. Moreover, going somewhat beyond Rosch, 
there are distinct perceptual boundaries between these adjacent categories. The 
categories and boundaries are "natural" because they remain largely unmodified 
by learning or by environment. Rosch presented convincing evidence that natural 
categories exist in vision; here, I hope to demonstrate that they are prevalent 
in audition and are accompanied by equally "natural" boundaries. I will ue^ 



*To appear ift Tutorial Essays in Psychology , ed. by N. S. Sutherland (Hillsdale, 
N. J.;. Lawrence Erlbaum Assoc., in press). 
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findings of speeQTi research to establish particular patterns of results indica- 
tive of "catego^cal" perception, and then search for them in music as well. 
Before present^g any data, however, I will discuss a theoretical fr^amework in 
which to consicler categories and boundaries. 



For the last-two. decades a certain segment of the psychological community 

teryr^^^tieign-^P.rfigniTt!t>^ Irrteger^, The persistence with which this number 

plagues those of us interested in speech perception is far more than a random 
accident. There is, to quote a famous senator (and perhaps a more famous psy- 
ehologist), a design behind it, some pattern governing its appearance. Either 
there really is something profoxind about this nximber or else we are all suffer- 
ing from delusions of persecution. Our niimber, however, is. not seven; it is 
two. «^ > 

it is no mere trick that I choose to paraphrase the first paragrapli of 
George Miller's famous paper from Psychological Review (1956). Information pro- 
cessing has certainly burgeoned in the twenty years since his paper appeared, 
and talk of channel capacities and bits of information has since filled many 
books and articles. One may worry, then, that those of us interested in this 
smaller integer are somewhat misguided, if not stunted: perhaps eac^ of us is 
only two-sevenths of a proper psychologist, or perhaps our student subjects are 
only two-sevenths as bright as most. This is not the case (we hope). Whereas 
Miller is concerned with an upper limit of perceptual processing, we are inter- 
ested in a Ibwer limit. In addition, we are interested in the possible benefits 
derived from binary systems. In information- theory terms. Miller is a three-bit 
researcher; we, on the other hand, are not even two-bit but rather one-bit re- 
searchers. 

Psychologists and others have come very late to one-bit research, especial- 
ly as it is relevant to language. Millennia before engineers and their computer 
science stepchildren thought in terms of binary electrical circuits, before 
physiologists discovered all-ror-none neural firings, and before geneticists 
postulated dominant and recessive gpnes, Greek and Sanskrit gramoaiarians were 
discovering the magical number two in distinctive features. These binary systems 
are fundamental to language: "the dichotomous scale is the pivotal principle of 
the linguistic structure" (Jakobson, Fant, and Halle, 1951:9). Spoken language, 
in particular, is a house built on the pumber two (see also Lane, 1967). 

Consider some itiiportant binary oppositions in speech, using /ba/ , as in 
bottle, as a reference syllable. Much as a dollar sign denotes that nmibers are 
American money, the slashes here indicate that the letters between ttiem are 
spoken accorcflng to the International Phonetic Alphabet- It is reasonable that 
/ba/ should be jconsidered a' central utterance in a scheme of speech tokens. Un- 
like many speech sounds, the elements /b/ and /a/, and the syllable itself, ^are 
nearly universal to allJ-anguages of the world. A related syllable, /pa/, as in 
pod , is also nearly universal. Together, the two consonants /b/ and /p/ are a 
voiced-voiceless pair and differ only dn the relative timing of the opening of 
the mouth and the initiation of pulsing in the larynx. For /ba/ the timing is 
nearly simultaneous in English, whereas for /pa/ there is a slight delay in the 
onset of voicing, which is preceded by about a twentieth of a second of whispet. 
This distinction Is Important, because there is no^speech sound, or phoneme, 
that is intermediate between /b/ and /p/. 



. - ^ — : ^ . 

k* The Magical Ntanber Two in Speech . Sounds 




190 



193 



Another binary paff is /ba/ an^/TttavU_which differ in manner of production; 
/ma/ is nasalized, /ba/ is not7 blit <xtherwise~'^th identical speech sounds.' 

When a child says "I have a cold id by doze^r^ -we can appitTeei^e the effect of 
clogged nasal passages on the -neutralization of thiis "phonetic dis t±n€4:JLon . A 
third pair is /ba/ and 7da/^ wlii^ih Atf fer in. place of articuratipns /ba/^ is - . 
labial, produced at the lips, and -/da/ is alveolar in English, produced by plac- 

in g the Itbng ue on. t he alveol ar ridge beh ind the teeth\ Just as there is n o 

speech sound between /ba/ and /pa/, there are none between /ba/ and /ma/ and 
none b etween fh^f and /da/. „ 

Until World War II these distinctions were based on little more than three 
thousand years of intuition about the nature of speech production." Psycholo- 
gists, wary if not skeptical of intuition and typically more interested in per- 
ception than production, did not become interested in speech until the invention 
of the sound spectrograph. This device transforms sound into a permanent visual 
record of time, frequency, and intensity patterns. [See Potter, Kopp, and Green 
(1947), for elegant and detailed examples of sound spectrograms.] Shortly after 
the invention of jfhis auditory-to-visual transform ^came its inverse, A device 
known as the pattern playback, which transforms a visual display into ^ound. 
Through a period of interactive experimentation with these tWo devices, many of 
the important acoustic cues were discovered that separate speech sounds from one 
another (see Libepnan, Cooper, Shankweiler, and Studdert-Keiinedy, 1967, for an 
overview). Schematic spectrogi^ams of the four syllables of particular interest 
here are shown in Figure 1. Since the three pairs are logically orthogonal, 
they are displayed as if in three-dimensional space. These examples are exactly 
like those used for the pattern playback, and would be highly. intelligible (if 
somewhat metallic and "unnatural" sounding) when played through that dev;Lce. 
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Figure 1: Schematic spectrograms of /ba/ (as in bottle) and three other sylla- 
bles whose initial consonants differ from /b/ along one phonetic 
feature. 
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Observe the 4^p|jstic diffetences between' the syllable pairs. Although all 
pairs are very simij^, /ba/ and /pa/, for' example, differ in two ways. In /pa/ 
the first formant, aij^^iark resonance band of. lowest frequency, has been ctit back 
from stimulus onset l^¥|i^bout 60 msec. Also, the excitation pattern of the 
second^ or higher, foi^|^^ has changed > Instead of being excited by a periodic* 
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glottal source in the it is excited by aperiodic or nbiiellke turbulanees 

In the mou th cavity. Na ^^^^ l speech tokens would typically have a third and 

other higher formants. 't^lfe^^as the third formant carries some important phonet- 
ic InfoCTation, the four tl|^^4 higher formant^ carry li ttle or none; the first 
two carry the bulk of the linguistic information load and suffice for these syl- 
lables. The syllable pair ^^/ and /ma/ differ mostly in the addition of steady- 
state nasal resonances to /ma/# They-^xtettxd from just before to just after the * 
release of constriction at thC lips -(which creates the formant transitions). 
For /ma/, however, the f irst-foirmant transition is less prominent. The differ- 
ences between /ba/ and /da/ are" perhaps the smallest and conceptually easiest to 
visualize of the three pairs, tn /ba/ the second formant glides upward in fre- 
quency at syllable onset, whereas ;I,n/ /da/ the secijnd formant glides downward by 
about the same amount.. It will be instructive to consider this pair in more 
detail. ' 

Identifying . Humans have little success in ptoducing speech sounds inter- 
mediate between /ba/ and /da/. Computers-driven speS'ech synthesizers, on the 
other hand, can easily be programmed to 'produce these unlikely sounds. When a 
seven-item continuum of uttetajices is geii<?i:ated from /ba/ to /da/, the syllables 
array themselves as '^shown ii?l the left panej of Figure 2. When these seven syl- 
lables are randomly order e^l and presented taany times, and when listener^ iden- 
tify 'each as either /ba/ 4t /da/, we find our first empirical manifestation of 
the magical number two. - Complementary identification functions show discrete 
perceptual categories as seen in the upper- left panel of figXjje 3. These are 
actual not idealized dat^. Notice that the first three stimuli in that' array 
are almost always identified' as /ba/, and that the last three items are almdst 
always identified as /da/. (Stimulus 4 is percei^ved as /ba/ about half the time 
and /da/ the "other half. )^ The stimulus difference? appear to be perceived in a 
discrete rather than continuous manner. 

However, one should not be overly impressed wlt^j the quantal nature of 
these complementary functions. Imagine an array of lines tilted at various 
angles like that shown on the right of Figure 2. If we "read" these- lines from 
left to right. Stimuli 1 througH 3 might be considered "ascending" and Stimuli 5 
to 7 "descending." Increments of physical difference between members of this 
visual array are exactly equal in angular degrees, just as increments in the 
/ba/-to-/da/ auditory series are equal in slope change of the second- formant 
transition. When the visual stimuli are moutrted on cards and viewers are asked 
to classify each as ascending or descending, ye find nicely quantl^^ed identifi- 
cation functions shown in the upper-right p4nel of Figure 3, with Only Stimulus 
4, the true horizontal, not a member of either category. Clearly, \he auditory 
and visual results are similar, and nothing would appear to be peculiar about 
speech. 

As a further demonstration that identification* functions should not be 
overemphasized, consider what happens when we ask the same listener/viewers to 
classify the continua into three categories instead of J:wo. The speech-syllable 
choices here are /ba/, "ambiguous" (not convincing as either stop consonant), 



192 

19o 





>• 

PC 



- < 

o 



!ii ^ ^ 

5 o «i 
u 



CO 



QQ 



CO 




o 



o 
in 



io3yyoo iN3oy3d 



- w 



■ CO 



- «o 



- 10 




UJ 



3 



0) 

CO 4J 



o o 
o 

2B 

(d o 

4-1 

a 

a 4J 

cd -H 
CO Q) 

(d 



I 



J3 CO 
O CO 

Qj (d 

CO a 
o 

O 



n CO M 

(d a o 

M O 
(d 4-1 4J 

cd 
o 



SI 

O Q) 
M O Q) 
O U 

^ §:g 

§ " o 
o q 4J 

O /-s § 



M-l QJ 

P a 

60 « ' 



a 
o 



M-l 



•H 

q CO 

*rl CO 
(d 



d g oj 

•d CO 4-1 



en 



•H 



194 



Figure 3 



ERIC 



and /da/; and the slanted-line categories are ascending^ horizontal, and de- 
scending. Results are shown In the lower p/anels of Figure 3. Both classes of 
stimuli yield similar identification patters, with the third categories sup- 
planting the old boundaries in the two-category tasks. From these results, 
speech perception would appear to be no different from the perception of objects 
and events In other modalities. Moreover, the magical nuipber two would seem 
irrelevant. 



Two statements must, be made before entertaining the notion that thesp con- 
clusions are legitimate. \First, the two stimulus series in question were judi- 
ciously selected. Few acowetic continua generated by a speech synthesizer 
appear to have phoneme boundaries with near-zero slope'^in the second-formant 
transition: /ba/-to-/da/ is closer to being an exception than the rule. Devia- 
tions from the peculiar regularity in this syllable array, such as that found in 
a /bi/-to-/di/ acoustic continuum\(^s in beam to deem) , would be much more dif- 
ficult to model in a visual continihp. Second, we should consider the nature of 
the middle categories in each set of responses. Intuitively they seem quite 
different. The middle visual category would appear psychologically more real 
than its neighbors. Indeed, the terms ascending and descending are derived with 
reference to horizontal. The middle speech syllable category, on the other hand, 
is a tenuous if not bogus domain. Certainly, /ba/ and /da/ are not derived per- 
ceptually wit^ reference to an ambiguous stimulus that is difficult if not im- 
possible to pronounce. In short, we see horizontal lines every day; we do not 
"hear" ambiguous speech sounds. Just as with a Necker cube, the percept flips 
one way or the other r it is either /ba/ or /da/, and rarely anything else, un- 
less one asks the subjects to perform the unusual task of "ambiguating" the syl- 
lables as I have done. 

Discriminating . Given these clues that a /ba/-to-/da/ acoustic continuum I 
is perceived somehow in a unique and quantal manner, we should look to a second 
and more important manifestation of the magical number two — nonlinearities in 
discriminability. 

If a listener /viewer i/'^ked to compare two members of one of the arrays 
of stimuli used thus far, how accurate are her responses,? For purposes of uni- 
formity, both arrays of stimuli are presented in a sequential, discrimination 
task: the first stimulus is presented, followed by a silent or blank interval 
of one second, followed by the second stimulus (either identical to the first or 
two steps removed along the physical continuum) . In this manner, along with 
item-pairs that are identical, Stimuli 1 and 3, 2 and' 4, 3 and 5, 4 and 6, and 
5 and 7 are compared* Subjects are asked to report whether the two items are 
the same or different. Only the "dif ferent"-pair results are of interest here 
and are shown in Figure 4; few errors occur on "g^ifte"-pair discriminations in 
this type of task. Notice the sharp discrepancy between the two darker func- 
-tions. The speech-syllable data, shown in the top panel, demonstrate a sharp 
peak in discriminability at the Stimulias-3/Stimulus-5 comparison that rapidly 
tapers to lower- thati-chance performance at either end of the continuum. The 
slanted-line function, on. the other hand, is ator near 100 percent performance 
throughout the stimulus, range. ^ 

Comparing these discrimination results with the two-category id^tifi- 
cation functions superimposed on them, we see that for the speech items there is 
a correspondence between the crossover* of the complementary identification 
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Figure 4: Two-st^p discrimination functions for speech syllables and slanted 
lines, superimposed on their respective identification functions. 



curves and the peak in the discrimination function. Labelabllity changes in- 
versely with discriminability. Items can only be perceived as distinct from one 
another when they have different names. This nonlinearity lies at the heart of 
the interest in the magical number two, it is called catego^icaiLjercep tion , and 
it is in sharp contrast to J: he performance on the same task for zne slanted-line 
stimuli* Acoustic dlMerences between speech Stimuli 1 and 3 or 3 and 7, for 
example, are typically inaccessible to the listener. However, the magical num- 
ber two is not as absolute as it seems here, with two discrete categories and a 
perceptually distinct boundary between them. It is necessary to consider cer- 
tain systematic deviations from strict twoness. 



B. Plus or Minus Two Fudge Factors ' 

Just as Miller (1956) has his small margin for 6rror, a fudge factor of 
plus or minus two, we d^eech researchers also have ours. It is (Considerably 
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smaller; it Is difficult to scale down to size in terms of numerical deviations 
from our magical integer; in fact, it may be difficult to determine whether it 
is actually positive or negative. What is clear is tha^t* it manifests itself, 
roughly, in two "sizes," one trifling ai^d the other nOtrso-trif ling. 

• The trifling deviation from the magical number tJwo, and the strict cate- 
gorical perception it implies, can be seen using^he «aii)ie /ba/-to-/da/ contin- 
uumT"" iffieiTTis tene^ SHmTir ranH^^JT both ^ 

of which are identified as BAH nearly lt)Q petcent of the time^ and they are 
asked to judge which of the two is more BAH-like — or In Rosch^s (1973) terms, 
which is the prototype — they will typically determine that Stimulus 1 fits the 
bill. They will do this, howeVet, only after they have laughed at the experi- 
menter for asking them to perform such a ridiculous chore, only after she has 
reassured them that there really is a difference between the items, and only 
after she has cajoled and ^xhorted them to do the best they can. Even after 
these machinations, their Jperformance is rarely close to being perfect. This 
special tutoring of within-phoneme-rlrtss acoustic differences doe§ not appear to 
transfer to other stimuli, and often is not useful as pretraitiing for other 
tasks with the same stimuli. The fact that subjects can report differences be- 
tween two different tokens of y(ba/ is more a testament to what the human percep- 
tual apparatus can do In an unuBual situatipn rather than what it doe6 do in a 
normal situation. In all fairness, however, it should be noted that this devia- 
tion from the magical number two is more trifling in size than in theory. Its 
discovery by Barclay (1972) was a blow to some stricter views of speech percep- 
tion. 

The not-so-trifling deviation from the magical number two requires another 
set of stimuli, and it is theoretically even more impor^iant than the first fudge 
factor. Shown at the top of Figure 5 are the endpoints pf an acoustic continuum 
of vowels from /i/ as in heat to /i/ as in hit. Betwe^ them, one can easily 
-*fgenerate five intermediate stimuli, thus creating a seven-item array with equal 
lljicrements of acoustic change between all members. Here, instead of changing 
slopes of transitions, the frequencies of entire resonances ard* changed, in- 
creasing in value from Ai/ to /i/ for the first formant and correspondingly de- 
creasing in value for the second and third formants. (The addition of the third 
formant here increases intelligibility, but the array would yield nearly identi- 
cal resuMs without it.) When theae items are randomly ordered and presented 
many tlmes^to listeners, results:) show quantal identification function© similar 
to those gnfewn at the top of Figure 3 for consonants and for slanted lines. 
'That is,-^imuli 1 through 3 are ' identified as /i/. Stimuli 5 through 7 are 
identified as /i/, and only Stimulus 4 is ambiguous between the two. Discrimina- 
tion results, however, reveal a pattern unlike those for consonants or for 
slanted lines. They are shown in the lower panel of Figure 5. 

Notice that the vowel discriminations lie intermediate between the previ- 
ously discussed consonant and slanted-line functions. There is a "peak" in the 
function at the Stimulus-3/Stimulus-5 comparison but the "troughs," or regions 
of poor discriminability » are not nearly as "deep" (close to zero percent per- 
formance) as those for stop consonants. The reason for this appears to be that 
within-category differences between two tokens of the same vowel remain in 
short-term memory long enough for accurate comparisons to be made. Performance 
on theae comparisons, however, is still worse than for those made at the phoneme 
boundary. The reason for the trough-peak difference is related to the way in 



197 

ERIC \ ■ 200 



N nr 

stimulus"! stimulus? 




0 300 0 100 

msec 




slanted lines 

vowels 

consonants 



V9 M M 44 57 



comparison 



Figure 5: Schematic spectrograms 
of /i/ and /i/ (as in eat and it) , 
endpoints of a seven-item acoustic 
array, and the discrimination 
function for that array compared 
against the slanted lines and stop 
consonants. 



Vhich the information is encoded. For speech stimuli the height of the peaks in 
any discrimination function can be taken as a measure of the strength of phonet- 
ically coded information, and the depth of the troughs in relation to those 
peaks can be interpreted as the relative strength of acoustically coded informa- 
tion. Only the phonetic code is relevant to short-term memory as it is usually 
defined; the acoustic code fades much more rapidly. Differences between the 
consonant and vowel functions at within-phoneme-category comparisons are one 
assessment of the magnitude of this hot-sc-trif ling deviation from the number 
two. Some raw acoustic information about vowels is available for comparison 
purposes; practically none is available about the consonants. 

One feature of these discrimination results that I have not discussed so 
far is the effect of the duration of the silent interval between the two stimuli 
in the sequential discrimination task. For the vowel stimuli, this interval is 
vital for determining the depths' of the troughs in the functions. If the inter- 
val is shortened from one second to a quarter of a second, listener performance 
on within-category comparisons of Stimuli 1 and 3 or of Stimuli 5 and 7 will in- 
crease to as much as 85 or 90 percent. If, on the other hand, this interval is 
lengthened to as much as two or three seconds, listener performance on these 
same comparisons will decrease to as low as AO to 50 percent.. There is no such 
effect of silent or blank interval on consonants, and probably none for slanted 
lines (although I have not^ done the experiment). For consonants, in particular, 
the within-category acoustic information appears to be lost prior 'to the onset 
of the second stimulus in a to-be-discriminkted pair, regardless of hcrw short 
that interval may be. Perception of consonants, then, is almost instantaneously 
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phonetic. Practically no raw acoustic husk remains in memory. In Rosch's 
(1973) therms, our perceptual apparatus warps all stop-consonant stimuli that 
fall^wlthin a phoneme category, changing them into a perceptual prototype. 
Once that prototype is internally registered, the nonprototypic vagaries of the 
stimulus are largely inaccessible to Co^^i^ipusness except through experimental 
exhortations like those discussed earisfi^tf A more careful account of these 
phenomena and the extent to which they occur has been developed by Pisoni (1973, 
TS^ST; Tlsbnl and Lazarus, 1974; Pisoni and Tash, 1974), and the' interested read- 
er should refer to those articles. 

There are many psychological differences between stop consonants and vow- 
els. I feel" that it is incorrect, however, to say one class of phonemes is more 
"speechlike" than the other. Nevertheless, since the perception of stop cqnso- 
t^afffcs adheres more to the magical number two, it is those stimuli that we will 
consider in depth In this chapter. Before continuing with the stops, however, 
we must djonsider the possibility of categories and boundaries elsewhere in audi- 
tion. » V 

C. The Mkgldal Number Two-in Nonspeeeh Sounds 

'i 

> For years, those of us interested in speech and the number two (implied by 
the A phenomenon of categorical perception) have called those auditory eventls that 
arei'^not speech "nonspeeeh." Underlying the use of this handy and lii^guo-centri- 
cally biased term is the belief that speech is somehow different from all other 
auditory events, just as it is different from the perception of slanted Mnes. 
Most of us, I think, still believe this to some degree. I do. But, whereas we 
used to be armed with an arsenal of empirical data supporting the uniqueness 
speech processing, lately we have begun tostrip ourselves of these findings. 
The jnost important weapon in our arsenal an3"-&eemingly the most invulnet^bXe to 
attack was categorical perception. ^ I 

Shortly after the initial formulation of categorical perception — that it 
involves a discontinuity in discrimination functions for stimuli equally spaced 
along a physicil continuum — arcise the issue of "acquired distinctiveness?" Ex- 
pressed as a question in its simplest form: Are the discrimination peaks learned 
for stop-consonant stimuli? Di> children, for example, acquire the distinct 
spe^h categories, or are they innate? A dozen years after the question first 
arosevJ[.t was answered conclusively, and that answer is discussed in Section D. 
At the time, however, it was no^ possible" to test young infants, so the question 
was asked in another form: Can\ nonspeeeh discrimination peaks be acquired 
through training? Perhaps the process of acquiring categories and boundaries 
follows a developmen^l trend: ^initially, \all stimulus painrs may be equally 
discrlmlnable to th^'' untrained IJistener and her discrimination function would be 
"flat" and moderately above change level throughout the range of the continuum; 
only later, with training, woiild a peak appear in this function, and perhaps the 
troughs would correspondingly drOp within each category. 

Harlan Lane,' an early proponent of this^ view, found that cWtain subjects 
listening to certain complex nonspeeeh sounds (spectrographically inverted speech 
patterns) acquired discrimination peaks through a simple training procedure 
(Lane, 1965). * Pisoni (1971), in k careful replication with similar stimuli, 
found this to be true for a few selected subjects, but generally not true. 
Moreover, St udder t-Kennedy, Liberman, Harris, and Cooper (1970) found even 



202 



199 



Lane's selected data unconvincing: whereas there were peaks in his funfitions. 
Lane found few deep troughs and less correspondence between discrimination and 
identification functions than would be desired. As seen when comparing the per- 
ception of vowels and consonants in Figure 5, troughs are vital^ If the dis- 
tinctive nature of the peaks c^n be acquired through training, and it is not en- 
tirely clear that they can be, thjg trofughs do not appear to be learned: there 
seems to be no, pneumatic trade-off between the acquisition of peaks in training 
and the loss of ability to discriminate-jtHthln a category. ^ ~ 

But the Lane and the Pisoni stimuli were not "natural" nonspeech sounds; 
nor are sine wave tones and other more familiar psychoacoustic stimuli "natural" 
in any ijeal sense. Are there commonly occurring stimuli in our environment that 
are perceived categorically and that obey the laws of the magical number two? 
An obvious candidate here is musical sounds. They are natural at least to the 
extent that they rely on simple mechanical action of easily fa'shioned materials. 
Locke and Kellar (1973) variejl the middle component of triadic chords in search 
of categorical perception, and found some categorical tendencies in musically 
trained listeners, but few in musically naive listeners. Th^ir results were 
promising, but with two important drawbacks. First, the discrimination func- 
tions for the musically trained listeners were more similar to the vowel fxinc- 
tion shown in Figure 5 than the typical stop-consonant function beneath it. 
Second, and more damning, is the fact that extensive musical training seemed to 
be a requisite for even these vowellike functions. Again, we are back to 
"acquired distinctiveness," and to the lack of sufficient troughs in the dis- 
crimination functions. ^ 

More recently* categorical perception has been found iri a musically rele- 
vant dimension and the results meet all the requirements for binary processing 

f according to the laws of the magical number two (Cutting and Rosner, 1974; 

/ Cutting, Rosner, and Foard, 1975). The dimension is that of attack, or rise 
time. Rapidly rising sawtooth waves, for example, sound like the plucking of a 
stringed irfstrument, such as a guitar; more slowly rising sawtooth items sound 
like the bowing of a similar instrument, such as a violin. Oscillograms of 
token "pluck" and "bow" sounds are shown in Figure 6. Rise time cannot be sys- 
tematically varied when playing actual musical instruments, but it can be varied 
readily on a Moog synthesizer. We chose to vary the rise time from 0 to 80 msec 
in 100-msec increments for several continua of musical sounds. Note thalysuch 
variation is minor in magnitude compared to the long and tapering offset o3S4:lle 
stimuli. 

When these stimuli are placed in the same paradigms mentioned previously 
for the speech syllables /ba/ and /da^, they yield remarkably similar results. 
Sawtooth wave items with less than AO-msec rise time are identified as pluck 
nearly 100 percent -of the time, stitauli with rise times greater than 40 msec are 
identified as bow nearly 100 percent of the time, and only the 40-msec rise time 
stimulus is ambiguous (identified as pluck about 40 percent of the time and as 
bow 60 percent of the time) . When these stimuli are placed in cthe sequential 
discrimination task, items within a category sound identical to the listener: 
asked to j\tdge whel;her pairs of stimuli are the same or different, listeners re- 
port that items of 10- and 30-m8ec rise time iand of 50- and 70-msec rise time 
are the same item more than 75 percent of the fime, well below chance. Only 
when items with 30- and 50-m8ec rise times are compaJjed do listeners perform 
well, and here they make only about 15 percent errors. These results clearly 
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indicate that there 'are quantal identification functions for pluck and bow 
sound Sj that there is a peak in the discrimination function lying astride -the 
crossover point of the complementary identification functions, and that the dis- 
crimination function falls^bff into deep troughs at either side of the peak. 
Thus, the magical number two reigns in nonlinguistic domains as well as in 
speech. ^\ 



Six other aspects of our data are- important concerning categorical percep- 
tion of these miisical sounds. First, the time interval between the items in a 
discrimination pair matters not at all: performance on within-categoxry pairs, 
for example, is no better when the two stimuli are separated by 250 msec than 
when they are 'Se'parated by nearly 'two seconds.. These results strongly suggest 
that the musical sounds adhere to the magical number two more strictly than ^o 
vowels and that the "fudge factor" is likely to be of the trifling rather than 

> the no tr-so- trifling size. ^ * 

«, • « . 

Second, the results do not appear to be mediated directly by the labels 
pluck and bow . In our initial study we were careful tp administer the test in 
two conditions. In one we carefully tutored subjects in the use of the terms 

^ pluck and bow, playing extreme tokens from the continuum sWeral times before ~ , 
they participated in the identification test^ The identification test gave thelt 
an additional 15 miAutes o£ practice using th^ terms before they listened to 
theii?' first "discrimination pair. In a second condition the subjects started 
right away with the discrimination test, did not hear practice .items, and were 
not told of the labels pluck aoid bow. The results tor the two groups were essen- 
tially identical and -suggest that the labels pluck ^nd bow have little to do 
with the perceptual process. . ♦ * 



Third, one might think that our results arev'fobust because the stimuli 
eipulate common musical sounds or because the .stimuli have complex* spectra. 
Would such results occur for simpler auditory events, varied in the same mannCT, 
-that do not sound like convincing tokens of musical sounds? We varied rise / 
times of sine wave stimuli and found the same pattern of identififcation and/dis- 
crimination results as for the sawtooth items, ^hes^e results indicate that the 
perceptual process involved here is more fundamental than just a musiC7process- 
ing system .might be* Sine waves with rapid rise times sound vaguely like a 
flute played staccato style; but sine waves with less^ rapid rise times are not 
at all convincing as notes .from a flute played in a more legato style. 

Fourtfi, one ijiay be suspicious of our stimuli, since we played them over - 
loudspeakers and> earphones. Because certain members' of th^ musical arrays have 
very rapid rise times, they are likely to induce clicks into the transduced ""sig- , 
nal; that'is, the response characteristics of the broadcasting devices may be 
sluggish enough to^j^bduqe audible short-bursts at the beginning of^ the items. 
One fear is that tjSe presence or absence of sucb clicks might correl'ate perfect- 
ly with the perceptual categories: pluck equals piresence of a click, bow equals 
no cli'cki To ensure that this artifact did not account for our results, we In- 
spected our stimuli after they were played through a loudspeaker, • redisplaying 
them with high resolution on a computer-controlle^i oscilloscope. We found such ^ 
small perturbations in the signals only for the 0- and..lO-msec rise time stimuli. 
However, items with 20- and 30-msec rise time, which were members of the same 
perceptual category ,^di(L not have these irregularities. 
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Fifth, we found that the long tails on the pluck and bow stimuli a|Ce neces- 
sary for the suc'Ce8sful/iS[en,ti£ication and discrimination of the^i terns. When 
the itemis are trlnmied to 250 msec in duration by simply lopping off thek last^75Q 
to 850 msec pf the items, identifiabiXity -^afld discriminability are markedly' im- 
paired. At first, this seemed rather bizM-pe> to us: the final three-qiiarters 
of the stimulus does not appear to carry /aay information about stimulu^ onset 
and would seem tinnecessar'y fojr>icaintaining musical integrity of the.,so^ds. 
This view is clearj=y Incorrect'. Moreover, we should not have been surprised at 
these results. When speecte' items such as /ba/ and /da/ are severely trimmed 
yfrom 300. msec to 40 or ^0 msec, removing only the steady-state vowel^ they often 
cease to sound like speech stimiili and are unl^belable by many listeners. The 
integrity of the syllables is thus violated, and in a manner similar td the. 
truncation of our pluck^;^nd bow Items. " , 

^ . . * .. 

Sixth, and perhaps mb'kt interesting of ^all, is the fact that rise time iff 
not only a cue for the^i^is tine t ion o£ unisical items, but it is also used as, a cue 
in speech: CHA (/t/a/), ^as in chop , has a very rapid rise time 'in its fricated 
(or noiselike) portion, whereas SHA (//a/), as in shop , has a much more gradual 
onset. Tokens of these speech syllables are shown in Figure 6 next to the pluck 
and bow items. When an array of these syllables ie generated on a computer- 
driven speech synthesizer, and when they are inserted into the same paradigms as 
w^ have discussed thus far, listeners yield patterns of identification and dis- 

• crimination repults that are nearly identical to those for the pluck ^iid bow 

items. We find it compelling that a single cue, rise time, is ijsed to distin- 
guish categories inside and outside of speech. While speech production and 
speech perception are unique to man, we should iiot expect all speech processing 
mechanisms to be*unique as well.. In evolutionary terms, it would have made 
sense to 'build a speech processing system oh underlying and already existing 
auditory faculties. It seems reasonable ^"at at least some of the binary dis- 
tinctions on which speech is built would be based on binary auditory distinc- 
tions. We suggest rise time^.is one of them (Cutting and Rosner, 1974). r ^ 

To account for discontinuities in the discrimination functions of stimulus 
arrays /ba/-to-/da/ and /i/-to-/i/, many speech researchers have thought in 
terms of phonetic and auditory iiemories. The peaks in these functions have been 
taken as a measure of the* ^i^n^h of phone t4:&/megip'ry similar to the more com- 
monly known short- term^ifemr^^ and the troughs, in 'delation to the peaks, are 
taken as a measure of an auditory memory similar to wh^t is often called echoic 
memory.. The categorical perceptions of -pluck and bow stimuli jar this view 
somewhat. The notion of an auditory memory accounting for the troughs remains 
unchallenged. However, the notion that a phonetic memory underlies the peaks in 
a discrimination function must be cast a-^ide. The peak in the pluck and bow 
discrimination function can in no way be thought of as- phonetic. Instead, this 
higher-level memory may be reserved fo^^ highly coded decisions about auditory 
signals; pluck versus bow or /ba/ vereus /da/ would both qualify here. It is 
relatively easy to understand why sp^ch sounds are categorical and coded into a 
phonetic string rather than left as raw acoustic information. The memory stor- 
age capacity required for one second of high-quality speech (such as' reproduced 
on a, tape recorder) would be 40,000 bits of information, whereas the storage 
capacity required for one second of phonetically coded speech would be only 
about 40 bits of infbrms^tion, plus the ftecessary subroutines to decode that 
string (Liberman^ Mattingly, and Turvey,/ 1972) . Clearly, in terms of a thousand 
to-one '^savingsyfin storage capacity, it makes sense to code speech phonetically 
and at a ra^iher rapid rate so that it can be comprehended. The rub, however, is 
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to understand why the same system appears to code and categorize musiclike 
soimds when such a savings may not be needed."* The answer is necessarily indi- 
rect, and must first take us back to the notion of "natural" categories and 
their apparent function. 

D. Naturalness of the Magical Number Two in Speech and Music 

Thus far I have pr^^sented what should be compelling evidence of discrete 
categories in speech and music, but I have said nothing of their "naturalness." 
Rosch concluded that certain color and shape categories are natural because they 
appear to remain largely unmodified by the presence or absence of language terms 
for them. To find exactly parallel results for speech items is difficulty 
Speech syllables have the unique property of providing their own nonarbitrary 
labels: /ba/ is BAH and /da/ is DAH, and they are pronounced and labeled as 
such by (almost) all peoples of the world. We are, therefore, forced to use a 
different technique for assessing naturalness of speech categories ,and bounda- 
ries, and we must use the, same method for those in music, 
\- 

Th^ Oxford English Dictionary defines "natural" as: "present by nature; 
innate; not acquired or assumed." If speech categories are natural by this 
definition, all humans should be born with the ability ^o use them. One approach 
for determining whether or not the perception of these categories is innate is 
to test young infants. For reasons of practicality, the infants tested have 
been from one to four months old. The assumption here is that these infants 
will have had little if any opportunity to learn much about their to-be-native 
language and that any -results they yield are characteristic of those capabili- 
ties that are genetically "wired-in." ' / 

Speech categories . It should be clear that one caijnot ask infants to iden- 
tify /ba/ and /da/. Young children typically cannot produce such differences in 
a systematic and controlled fashion until they are m^ny months older. Therefore, 
it is out of the question to try to obtain ident if legation functions, , One can 
^determine; however, whether or not infants can discriminate speech sounds and 
(discriminate them in a manner approximating that for college-aged subjects. 
This is exactly the approach of Peter Eimas and hife colleagues (see, for example, 
Eitnas, Siqueland, Jusczyk, and Vlgorito, 1971; El^nas, 1974; Cutting and . Eimas , 
1975) in a series of pioneering studies. 

^ ' - 

It is one thing to ask an Infant to discriniinate two speech sounds, \<but it . 

. is another to p^se that question in a manner for which he can give a suitable 
and measurable retoonse. Eimas has used a, conditioned nonnutritive sucking pro- 
cedure; others have used heart rate O^orse, 1972). In the Eimas and Siqueland 

, procedure the infant is given a hand-held nipple on which to suck. Instead of 
transducing nuttie^nts, it transduces pressure to a pressure-sensitive apparatus, 
which in turn triggers, for example, the speech sound /ba/. The more frequent 
the high suction responses pf the infants, the louder (or, in another procedure, 
the more frequent) the speech sound is presented against background noise. *The 
infant quickly learns this association and is quite willing to make several 
hundred sucking responses over the course of about ten minutes merely to hear 
the same sound* repeated. The time-frequency course of the infant's responses is 
of particular interest. Over the course of about three minutes after the ini- 
tial liiarning of the association, the infant increases his responses to a peak 
of a^ss^affich as 50 or 60 per minute', well above a preassociation nonnutritive 
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baseline. Shortly thereafter the infant seems to tire of tlie situation and re- 
sponses taper off rather dramatically in the .following two minutes. 

^ter a drop in responses of at least 30 percent, one of three things hap- 
pens to the infant. In one control Qondition the infant continues to hear the 
same stimulus. Stimulus 3 in Figure 2, say, over and over again. Jlesponses here 
continue to approach asymptote at or below the baseline rate. In the aqperimen- 
tal condition, however, the stimulus is shifted to /da/ (Stimulus 5) and the in- 
fant's responses begin to increase again, only beginning to fall the third or 
fourth minute after the stimulus change. Most important is the second control' 
Qondition. Here the stimulus shifts from Stimulus 3 to Stimulus 1, but both are 
identified as /ba/ by adults. In other' words, this change is physidally just as 
great as the- across-boundary shift, but both stimuli lie within the same cate- 
gory. As, in the first condition, the infant's responses continue to approadh 
asympto^te* All these trends are shown in Figure 7. 

Three aspects of these infant data are inte^resting when compared with the 
adult data discussed in previous sections. First, the across-category stimulus 
shit t in the experimental condition here corresponds to the peak in the discrim- 
ination' f One t ions seen in the top panel of Figure 4. In the infant's case, the 
dishabituatlon of the sucking response is taken as evidence that he perceives 
that a new stimulus has been presented, one that deserves more attention and 
subsequently more sucking responses. Hence, like adults, infants as young as 
one month can perceive phonetically relevant features. Second, the within- 
category stimulus shift in the crucial, second control condition corresponds tq 
the troughs in the adult discriT^ination function. Continued habituation of the 
sucking response is taken as evidence that the infant did not perceive that a 
"new" stimulus had been presented. Just as for the adult, ^ the Infant may have 
merely regarded the second stimulus as identical to the preshift stimulus. 
Thus, like adults. Infants as young as one month cannot perceive phonetically 
irrelevant changes in acoustic features even, when they are identical in magni- 
tude to the across-cal-egory, phonetically relevant change; Third, and more 
speculative than the previous two points, is that the -difference between the 
functions for the no-shift and wi thin-category-shift conditions suggests that 
even for ver/ young infanta there is a trif lii^grsized fudge factor that modifies 
the magical number two! Although the difference between these two groups has 
never been significant in a single study, ^ the trend is unmistakable: the in- 
fants in the withln-category-shif t condition attenuate their habituation rate 
slightly more than the nq-shift group. 

Do infants perceive categorically the same speech continua that adults per- 
ceive? It appears that they probably do, and maybe even more so. For example, 
they yield results functionally identical to those in Figure 7 when discriminat- 
ing a voice-onset-time continuum from /ba/ to /pa/, and when discriminating /ra/ 
from /la/ cued only by changes in the third -formant transition. This second re- 
sult is important since the difference is one that native Japanese-speaking 
adults cannot perceive and do not have in their language. Such a phenomenon 
presents us with the tantalizing notion that infants may be superior to adults 
in perceiving certain speech-relevant dimensions. It suggests that, while it is 
true that the distinctiveness of speech categories is not acquired by a lemming 
process, it may also be true that certain potential distinctions are lost when 
unused by the developing child. One might consider this process '^'acquired 
indistinctiveness." It might even appear to support some of Lane's (1965) 
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' original contentions. It does not, however, since it is not the troughs of the\ 
discrimination functions that get deeper (indistinc|:ive) but rather the peaks 
themselves that disappear. • 

Music categories . Are pluck and bdW sounds discriminated categorically by 
Infants? They ate, as, we have recently discovered (Jusczyk, Rosner, Cutting, 
Foard, and Smith, 1975). The stimuli used in this study were .selected from 
those used in the adult studies (0-, 30-, and 60-p8ec ri?e time items) plus an 
additional stimulus from the original set (with 90-m8ec rise time). Each of 18 
two-month-old infants was run in two conditions: all participated in a condi-;- 
tion involving the cross-boundary, 30- and 60-msec items, and six infants each 
were Involved in tfiree control conditions. One group performed the no-shift 
control; that is, each infant continued to listen to the same stimulus through- 
out the experim^rital sesfiiion. A second group listened to the 0- and 30-msec 
' - rise time items ^ and the third group listened to the 60- and 90-m6ec rise time 
items. Counterbalancing of pre- and postshift stimuli was dbsetved, as well as 
counterbalancing the order of the experimental (30 to 60 msec) and control con- 
ditions (no shift, 0 to 30 msec, or 60 to 90 msec). 

Results were compelling. Seventeen |gf the 18 infants demonstrated' a higher 
sucking response rate in the cross-category-shift condition (30 to 60 msec) than 
in the control Ctohdition, regardless of which of the three groups they belonged 
to. The general patterns of responses plotted over time was functionally iden- 
tical to those shown in Figure 7: habituation of the response continued in the 
no-shift condition and in both wi thin-category-shift conditions, while dishabitu- 
ation occurred for the aci:oss-category-shif t condition. 

The adult boundary for fhese sawtooth wave pluck and bow stimuli is at 
about 35- or 40-msec rise time; it is clear that the infant boundary, is near 
this same mark. It should also be clear that both speech categories and "non- 
speech" (musiclike) categories and boundaries are Innate to humians: one can 
think of little evidence '^supporting the notion that these young infants could 
I have acquired the distinctive categories demonstrated here. Thus, these cate- 
gories and boundaries appear to be "natural" according to the strictest possible 
psychological interpretation of that word. ^ 

An interesting question now arises. These categories and boundaries seem 
innate to humans ^ but are they Innate to other animals (e.g., primates) as well? 
In other words', is there any evolutionary continuity in the development of the 
mechanisms behind these effects? Morse and Snowdon (1975), for example, have 
tested rhesus monkeys in the discrimination of speech syllables such as those 
discussed here. Their i^psults do not support a strong form of categorical per- 
ception in infrahuman primates, but there are some categorical tendencies. 
Woujd rhesus monkeys discriminate pluck and bow sounds in a manner as striking 
as adult and infant humana? Alas, we don't know yet. Results here will be.in- 
terestitig regardless of the outcome.. If rhesus monkeys (or higher nonhuman pri- 
mates) do discriminate thes6 musiclike sounds, we will have evidence that the • 
perception of rise time, which is used to cue certain speech categories, has an 
evoltjtionary history older than that of speech. This would support the view 
that certain categories in speech were built on preexistirig nonlingiaistic cate- 
gories (Cutting and Rosner, 1974). If the nonhuman primates do not discriminate 
pluck and bow sounds, we will have evidence that the perception of rise time, 
and the categories and boundaries in speech and music that it cues, evolved rel- 
atively late. Moreover, waxing slightly toward the philosophical t music percep- 
tion (and subsequent music appreciability) has a recent evolution of its own. 
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In 'the first two-thirds o£ this paper I have presented considerable evi- 
dence to demonstrate the existence, and the "naturalness," of certain perceptual • 
categories and boundaries in speech and music. I .have not attempted t6 be over- 
whelming in breadth: few stimulua dimensions in speech were discussed] and only 
one in music, partly because of the lack of knowledge in this relatively new 
field. In the rest of the jpaiper I will consider the nature of the mechanisms 
behind these manifestations of the magical number two. 

E. MechanisfflS behind the Magical Number Two ,./3f 

' I ; 

Very recently a new paradigm has emerged in the field of speech perception. . 
The technique is known as selective adaptat4:On. Its roots are in vision research 
and in the mapping of the architecture of ^^in-cell function during perception. 
However, it is logically similar ^In many wayB^gp thfe much older and better-known 
phenomena associated with visual* afterimages. W'^^11 us^e those photochemical 
data as a basic framework fpr discussing the pe^^^^tual effect in speech. 

It is commonly known that if a person stares atS;^^patch of blue color for 
about 15 to 30 seconds, and then stares at an illumiri^fefed white wall, she will 
see a patch of yellow color with the same contour as tKii^'9riginal blue patch. 
This effect is known as a chromatic afterimage. It is bei^l;^ explainable in terms 
of opponent-rprocess me.chanisms, as first championed by Herii^M^ Blue is thus 
viewed as the opposite color to yellow^ at least at some staigetof photochromatic 
analysis subsequent t& excitation of the cones on the retina. \;j$|.,ue and yellow 
appear to synapse, if you will, onto the same cell bodies, but '^^ff^ color in an 
excitatory fashion and the other in an inhibitory fashion. Starii^j^ at a blue 
patch far many seconds fatigues the visual system such that when gly^l^ a neutral 
stimulus (whitp),,the viewer will perceive for a brief period of tim'^ofchat stim- 
ulus as being of the opposite color (in this case yellow) . In a reci]^ifo^aal 
fashion, staring at a yellow patch will yield a sensation of blue when view- 
er is presented with a postadaptation neutral field. \^ 

\ \ :\ 

One cannot stare at a speech syllable. Thus, the experimental adaptatibix 
situation must be changed to a degree for such auditory sounds. Since the audi,- 
tory signal fades rapidly [which Hockett (1960), for one, views as a blessing], 
it must be presented over and over, perhaps as many as 100 or. 200 times, to con- 
tinually refresh the perceptual "image." If this adapting, stimulus is /da/, for 
example, and one is presented with a neutral stimulus near" the '^/ba/-/da/ bound- 
ary (Stimulus 4 shown in Figure 2), the listener will perceive that neutral 
stimulus as being a good exemplar of /ba/. One may view the stimuli /ba/ and 
/da/ as being "opposltes" of one another along the dimension of change In the / 
second-formant transition. Hence, Just as yellow is the Opposite of blue, /ba/ 
is the opposite of /da/, and after adaptation, a neutral stimulus between the 
two prototypes will be perceived as being a member of the opposite class. More- 
over, after adaptation the physical domain of the category of the adapting stim- 
ulus shrinks, while the physical domain of the unadapted stimulus category ex- 
pands to fill the void. ^ ' 

The adaptation paradigm as used in speech perception is an exhaustive one. 
Typically, the listener is presented with more than 100 tokens of one of the end- 
point stimuli in a speech continuum (in this case the Stimulus 1, /ba/, or the 
Stimulus 7, /da/) and then given a brief identification test of the array of 
stimuli. This postadaptation test may consist only of one token of each of the 
seven stimuli in the array presented in random order. The subject identifies 
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each of these Items as /ba/ or /da/. Then another long series of adaptation 
presentations begins using the sane stimulus as before. After this sequence, a 
second seven-item series of stimuli is presented for the listener to identify. 
This cadence may be repeated a dozen times or more before the experimental ses- 
sion in completed. 

Typical results are shown in tha top panel of Figure 8» Given an adapting 
item such as the Stimulus 7 /da/, the nuniber of /ba/ responses to all members of 
the array tends to increase: ' Plotting only the number of /ba/ responses for 
each stimulus ia the array (unlike the coi6plementary plots in Figures 3 and 4) , 
one sees that the crossover point, or 50 percent response level, has shifted 
toward the /da/ end of the continuum. In particular. Stimulus 4, which is nor- 
mally^>dentlfied as /ba/ on only about 40 percent of all trials In a preadapta- 
tion identification sequence, is now Identified as /ba/ on better than 95 per- 
cent of hll postadaptation identif icati6n trials. 

At least two Important aspects distinguish the adaptation effect with 
speech stimuli from the chromatic afterimage. First, it lasts a much longer 
time: the chromatic afterimage may last for only about 30 seconds or so, where- 
as the shift in the /ba/-/da/ identification function may last up to a few hours 
or even longer. This result is directly linked to the second difference between 
the phenomena. The chromatic afterimage does not transfer from one eye to an- 
other. That is, if One looks at a patch of blue with only the left eye open and 
then stares at a blank wall with only the right eye open, there will be no 
afterimage. This simple demonstration indicates that the locus of the chromatic 
effect is peripheral , or very near the retina and certainly before the neural 
pathways of the two eyes first converge in the lateral geniculate body. The 
adaptation effect with speech stimuli does transfer from one ear to the other 
and generally maintains its magnitude. This result indicates that th^e locus of 
the effejct is central, and occurs after the pathways of the- tJ^o ears conirerge 
(as low in the system as the superior olivary complex or as high as the cortex). 
These two factors, the duration of the effect and its locus, make it more simi- 
lar to the visual work done in the 1960s by McCollough (1965) and by Blakemore 
and Campbell (1969), than to work with chromatic afterimages done originally in 
the nineteenth century. The adaptation work done in the field of speech percep- 
tion, like the infant work discussed in Section D, was pioneered by Eimas (Eimas 
Cooper, and Corbit, 1973; Eimas and Corbit, 1973) and has been reviewed recently 
by Cooper (1975). 

Several other aspects of speech adaptation are Important and ^re closely 
related. First, the speech results have been ifft(|rpreted in terms of feature 
adaptation. Features are thought to be processed* by perceptual decision mechan- 
isms. Adaptation shifts here contrast with the similar response shifts result- 
ing from changes in adaptation level (Helson, 196^), since the latter may be 
accountable in terms of cognitive decision mechanisms. Second, these features 
are binary: that is, they are neurological correlates of the magical number 
two. Third, these features are often thought of as phonetic in nature, that is, 
as linguistic rather than as auditory. By extension, they have been thought 
unique to language. These three points deserve elaboration. 

. Feature analyzers as j)erceptual mechanisms . A major thrust of the first 
speech adaptation study (Eimas and Corbit, 1973) was that the apparent shifts in 
the identification functions were not attributable to response bias or other 
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"conscious" shifts in decision criteria. As proof of this position one must 
find that not only do identification functions shift, but that corresponding 
discrimination functions also shift and by the same^extent* Indeed, Elmas and 
Cofbit-'(1973? and later Cooper, 1974) found that the discrimination functions do 
'9hift and by the anticipated aiuOUut. 

Although their data did not deal with /ba/ and /da/, I may J.egitimately 
'.generalize as follows. In a preadaptation identification condition. Stimuli 1 
through 3 are identified as /ba/ and Stimuli 5 through 7 as /da/. After adapta- 
tion with the Stimulus 7 /da/. Stimuli 1 through 4 are pntow identified as /ba/ 
and only Stimuli 6 and 7 as /da/. In other words,, whereas Stimulus 4 is the 
most ambiguous item in the preadaptation condition. Stimulus 5 is most ambiguous 
in the postadaptation condition. If this change in identifiability is perceptual 
in nature> a preadaptation discrimination peak should be at the Stimulus-3/ 
Stimulus-5 comparison, just as we have seen in Figure 4, whereas the postadapta- 
tion peak should be at the Stimulus-4/Stimulus-6 comparison^ This is exactly the 
type of result found by Eimas and Corbit (1973), and is shown in the lower panel 
of Figure 8. These are schematic, not actual, data since these authors used dif- 
ferent stimuli, but they accurately reflect their results. The categories of the 
magical number two change their locus with regard to the physical continuum, but 
they do not appear to change in any other manner. 

It should be noted that postadaptation discrimination data are extremely 
difficult to 'gather. Only one or two discrimination trials are given after each 
long sequence of adapting stimuli. Thu9, the task is very time consuming and 
only the most dedicated subjects will listen to the many hours of nonsense syl- 
lables, tediously and incessantly presented over and over again. 

Feature analyzers as neural mechanisms . Underlying the shif ts» in identifi- 
cation and discrimination functions fii^e some allegedly quite Simple mechanisms. 
A scheme of how they might work is sfedwn in Figure 9. Imagine two detectors in ^ 
the perceptual system, one whose primary 'job it is to respond to the phoneme /b/ 
and the other tb repsond to l^lf^ Each of these is maximally sensitive to a pro- 
totypic stimulus (pelhapl3 Stimulus 1 for the -/b/ detector 'and Stimulus 7 for the 
/d/ ^detector) . In. addition, each will respond to other neighboring stimuli as 
well, but at a somewhat reduced rate. The /b/ and /d/ detectors are relatively 
"close" to one another in "that they can respond to the same stimulus, provided 
that it is "roughly midway between the two stimulus prototypes c^long the physical 
continuum most relevant to the phonetic distinction, in this case the second- 
formant transition. Normally, the boundary between /b/ and /d/ is at the cross- 
over point of the sensitivity curves, as shown in the top panel. The Figure is 
drawn to be reminiscent of hypothetical signal 'detection functions and of a 
simplified one-dimensional rendition of Self ridge and Neisser's (1960) 
Pandemonium model of pattern recognition: at some neural level subsequent to 
the detectors themselves, a decision demon will decide which feature demon , that 
^ior /b/ or /d/, has yelled the loudest (which neuron has fired the most rapidly) 
and deserves to be recognized and identified over the Pa ndemoni um of screams 
(neural firings) of all the other demons (feature analyzers) . This ultimate de- 
cision determines the psychological identity of the stimulus. 

During extensive adaptation to the same stimulus, repeated over and over, 
the particular feature analyzer may fatigue. The precise nature of the fatigu- 
ing process .is not known, but one possibility is shown in the bottom panel of 
Figure 9. After adaptation with /da/, the /d/ analyzer may become less and less 
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sensitive to stimulation by the /d/ prototypes and to all similar stimuli that 
would normally trigger it. The decrease in sensitivity may manifest itself in 
several ways. First the height of the sensitivity cui^e may decrease, then the 
shoulderj^ of the function would slump inward* The effect would be that the new 
crossover point of the two feature analyzers woul4 be moved slightly toward /d/, 
away from the old boundary. The new crossover would mark the locus of the post- 
adaptation boundary between /b/ and-/d/» A refractory period of a considerable 
amount of time would be, necessary to restore potency to the /d/ analyzer. Once 
restored, however, it would resume the sensitivity function sfiown in the top 
panel and consequently tfhe old phoneme boundary would be restored as well. 

This simple, account seems to serve well in explaining shifts in identifica- 
tion and discrimination functions. Neverthefese, it is not without its problems. 
Consider one of the central aspects of this model: during adaptation, the fea- 
ture analyzer. becomes j^ore and more insensitive to the prototype of the stimulus 
category. If this wetre true> we would expect to find some way to measure the 
decrement in sensitivity to the stimulus prototype. The data in identification 
functions provide no help. Each function necessarily asymptotes at 0 and at 100 
percent by the time it reaches the prototype at Stimulus 1 or Stimulus 7. Hence 
we are restricted by floor and ceiling effects, and can make no inferences about 
any alleged attenuation in the sensitivity function. In a pilot study, Michael 
Posner and I supposed that a reliably more sensitive measure, that of reaction 
time, might serve to demonstrate the -anticipated effect. \Thufi we adapted listen- 
ers to one of the endpoint stimuli in this /ba/-to-/d&/ array, and measured 
their reaction time in responding to (identifying) each member of the array 
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Including the endpolnt stimuli* Results seemed very clear. Indeed* there was 
a ^shif t in the identification functions, just as we had anticipated, but there 
was no significant change in reaction time for the identification of stimulus. . 
prototype used in adaptation. An increase would have reflected, we thought, 
^he decreased sensitivity to the category prototype; a decrease In reaction time 
might have reflected something 61se. No change is more difficult to interpret. 

Perhaps, *then, the sen&itivity fxinction shown iri the bottom panel of 
Figure 9 is /incorrect. Perhaps, rather than decreasing in "height," the func* 
*'tion merely /becomes more leptokurtic — that is, just as tall, but much slimmer. 
This would nave the^ same effect in shifting the identificatioyi and discrimina-»- 
tion functions. Regardless of the final form of the model, it probably does not 
differ greatly from the one presented here and should serve to account for all, 
data. I shall return to this model later and discuss it in some detail from a 
different perspective. What must be considered now is the role of these feature 
analyzers ^n language and in music. • ^ 

Feature analyzers as linguistic mechanisms and as auditory mechanisms . 
Linguists have talked of the binary features of language for a long time, and 
the influence of Jakobson, Fant, and Halle's book Preliminaries tb Speech 
Analysis (1951) has been very influential in investigating the acoustic basis 
for these features. The notion of features was' so well developed in the 1950s 
and 1960s that this term for various stimulus aspects of speech may have been 
borrowed by neurophysiologists when they discovered neural devices in the visual 
cortex that responded solely to edges moving in certain directions. The psychol- 
ogist interested in speech and in effect's 'af adaptation might look to linguists 
and find one use of the term features , look to neurophysiolpgists and find an- 
other use, and then yearn to close the gap Between the two. Eimas and Corbit 
(1973) found this link and described it in their setiiinal paper. But were these 
detectors linguistic in nature, or not? There is a difference between a lin- 
guistic feature detector and a linguistically relevant feature detector. A lin- 
guistic feature detector would be one responding only to speech sounds and not 
to similar nonspeech. sounds. A linguistically relevant feature detector, on the 
other hand, would respond to certain speech sounds but would also respond to 
relevant nonlinguistic sounds as well. As yet it is too soon to determine con- 
clusively whether or not certain of these mechanisms are linguistic or merely 
linguistically relevant. Let me first present some evidence that supports the 
lattq,r view, then evidence for the former. 

A question arose, after the establishment and replication of the phenome- 
non, as to how linguistic these mechanisms are. That is to say, while there may 
be specific phonetic feature analyzers, are there general phonetic feature anal- 
yzers? The difference between specific and 'general is crucial. Specific detec- 
tors would be sensitive only to those aspects of a particular speech sound (/d/, 
for example) that occur in particular speech contexts (as in /da/). General de- 
tectors, however, would be sensitive to broader aspects of that speech souftJ /u/ 
as it occurs, for example, not only in /da/ but also in /di/, /du/, and /ad/ (as 
in deep , dupe , and odd) . In other words, will the effect, as measured by shifts 
in identification functions pre- and postadaptation, transfer across different 
\ vowel environments? Will it transfer across different syllable positions? The 
^ answers are yes and no, respectively. Ades (1974) found that adaptation with 
/de/ (as in date ) shifted the identification function of a /bae/-/dae/ continuum 
(as in bad, dad ) , but less readily than did adaptation with the endpolnt /dae/ 
stimulus. He also found that adaptation with /dae/ had no effect on the identi- 
fication of an /aeb/-/aed/ continuum. From such results the a'daptation effect 
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seems phonetic enough to transfer across small acoustic differences such as 
those seen when changing between similar vowel envlronmen^ts, but not phonetic 
enough to transfer across widely discrepant acoustic forms such as those found 
when shifting a phoneme from Initial to final syllable position. Restated, the 
effect Is general enough to transfer across small acoustic differences, but not 
general enough to transfer across large differences. 

The fact that^ acoustic differences matter at all Is an important Issue; 
the postadaptatlon function for /bae/-/dae/ Is shifted considerably by adapta- 
tion with /dae/, less by adaptation with /de/, and even less (that Is, not at 
all) by adaptation with /aed/. Rather than account for this array of results by 
phonetic feature adaptation,, one might account for the results In terms of 
acoustic or auditory feature adaptation. (I will use the term auditory so as to 
be more general than the term acoustic yilght allow. I will want to consider not 
only those acoustic features that are analyzed logically prior to the labeling 
'of speech sounds but also entertain the possibility of a higher-level feature 
analysis of an auditory signal, which would appear to occur beyond the registra- 
tion of the raw acoustic signal but would not necessarily Involve language. Audi- 
tory seems the best term here. Implying nonllngulstlc as well as postsensory. ) . 
Before asserting tliat auditory feature analyzers ar« possible alternatives to 
pho'netlc feature analyzers, pne must demonstrate that the notion of auditory 
analyzers 1^ viable. Here, we have only to look to the pluck and bow sounds 
again. 

We (Cutting, Rosner, and Foard, 1975) selected a 440-Hz (Concert a) saw- 
tooth wave continuum to use for postadaptatlon identification. Items differed 
in rise time in 10-msec steps from 0 to 80 msec. Rather than just two, there 
were eight adaptation conditions; adaptation within the same continuum of 
sounds using the 0- and 80-msec rise time^ 440-Hz sawtooth items, which we take 
to be our "prototype" pluck and bow stimuli, plus six other conditions. There 
was adaptation acros"s different frequencies using 0- and 80-msec sawtooth items 
at .294 Hz, adaptation across different waveforms using 0- and 80-msec sine wave 
itema at 440 Hz> and adaptation across both frequency and waveform using 0- and 
80-msec sine wave items at 294 Hz. The same very diligent listeners served in 
all conditions but on eight separate days, one per condition. A preadaptation 
Identification test was given before adaptation tests, and comparisons were al- 
ways made between pr^- and postadaptatlon identification functions for a partic- 
■ular day. 

Results were very clear in support of auditory analyzers of pluck and bow. 
Adaptation within the 440-Hz sawtooth wave continuum was considerable. The nor- 
mal boundary of about 40-msec rise time shifted to about 37-msec rise time in 
the pluck-adaptation condition, and to about 50-msec rlipe time in the bow-adap- 
tation condition. Both shifts were highly significant and the difference in 
their size is commoft'ln adaptation findings. Such differences are often diffi- 
cult to interpret, but here we think that they relate to inherent limitations in 
a continuum such as rise time. For example, a stimulus can be no more abrupt 
than 0-msec rise time, but can be infinitely less abrupt than 80 msec. 

Significant postadaptatlon boundary shifts occurred in nearly all other 
conditions as well, but their magnitudes tended to be smaller. For example, 
pluck-adaptation shifts across one dimension (frequency or rise time) averaged 
less than 3 msec, and complementary bow-adaptation shifts averaged only 5 msec. 
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Adaptation boundary shifts across both dimensions (frequency and rise time) were 
smaller still: only 2 msec for both pluck and bow. These results are regular 
enough to allow a simple interpretation: the more stimulus dimensions held in 
common between adapting and test stimuli the larger is the effect. Thus these 
results for pluck and bow items are very similar ta the array of results for 
speech items. 

Let us suppose that the somewhat smaller adaptation effect of /de/ than 
/dae/ on a /bae/-/dae/ test continuum is due to the fact that fewer dimensions 
are shared between /de/'^and the test items. These dimensions could be linguis- 
tic (the vowels differ between adapting and 'test «Mmuli) or they could be audi- 
tory (the spectra differ as well). That there is no adaptation effect of /aed/ 
on /bae/-/dae/ is difficult to account for in linguistic terms without relying 
on some allophonjc or syllable level of processing; but the result is relatively 
easy to account for' in auditory terms because the stimuli are very different. 
Ignoring the common vowel nuclei, the comparable transitions in all formants go 
the wrong way, that is, in oppOaite directions. For /dae/ the first formant 
ascends^i«il^eq^ into the frfLLowing vowel, whereas for /aed/ it descends, 
follow^g^^t. The reverse is true for the second formants. Perhaps, then, all 
shifts., due to adfiuptation are actually auditory in nature rather than linguistic: 
perhaps' they '^are.-Mue to adaptation of linguistically relevant features, not lin- 
guistic features. . 

How, then, does adaptation actually work? Perhaps it is not the feature 
analyzer itself that is fatigued. Instead, it may be the neural pathway leading 
to the ana^y^r that suffers fatigue. The more the adapting stimulus differs 
from the test*-iteta array, the fewer may be the number of intervening processing , 
stages (from the registration of the acoustic signal to the binary feature anal- 
yzers) that are shared between adapting and test* stimuli. Thus eeen, fatigue 
during adaptation might be an inhibitory process that builds up throughout the 
many neurons and synapaes of a particular pathway for a particular signal. If 
this notion is correct, then when 'tH^ test array differs from the adapting stim- 
ulus, each item in that ar^ay will travel a somewhat different neural path from 
that of the adapting .stimulus. The sections of the path for the test stimuli 
that are not held in common wlth^ the adapting item will^^ot be fatigued, while 
those portions held in common will .be fatigued. Roughly speaking, if only half 
of the pathway neurons are held in common, perhaps the adaptation effect might 
be only half as great. It should be obvious that the use of the term pathway 
here is at least partly metaphorical, but I do not mean it to be exclusively so 
(see Posner, 1975). 

If the pathway-fatigue account of selective adaptation is viable,. the sen- 
sitivity functions shown in Figure 9 and the decrease shown in one function in 
the bottom panel of that figure, may not mark the sensitivity of the actual an- 
alyzers themselves. Instead, they way mark, for a given stimulus, the rate of 
firing of the sequence of particulfur neurons in the pathway that leads to the 
analyzer: adaptation leads to pathway fatigue- and less neural activity. The 
end result is exactly the same. Rather than plot sensitivity on the ordinate of 
that function, one might substitute the term "neural activity." Boundary shifts 
would occur in the same manner.^ 

What about the notion of linguistic feature analyzers as opposed to lin- 
guistically relevant analyzers? The answer can come only after exploring farther 
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the nation of pathways* It seems to make 710 sense to think of a linguistic 
pathway} that is, a neural route that is traveled exclusively by speech stimuli. 
Such a pathway would necessarily have to have an early gating device that has 
already decided that a particiAar stimulus is speech* If the speech/nonspeech 
decision, were already made at this early level, then subsequent an^alyses would 
appear to be unnecessary: to determine whether or ..jidt an Item is speech, the . 
system would surely have analyzed the signal for speechlike features* Instead, 
then, the path^ys. leading to the binary analyzers are auditory; that is, gen- 
eral enough £0 handle both linguistic and nonlinguistic events. Since seemingly 
all the variance in boundary shifts across different types of adaptation situa- * 
tions might be accounted for by pathways, the existence of linguistic analyzers 
would be unimpugned. It is clear that there are auditory analyzers of similar 
nature^ at least for pluck and for bow, but they may exist side-by-side, as. it 
were, with linguistic analyzers. The only logical argulnent I can offer ^a^inst 
this possibility is an appeal to parsimony: Why ^have two kinds of binary fea- 
ture 'analyzers, linguistic and nonlinguistic, when one set of nonlingtiistic an- 
alyzers might do? It may be that the assignment of tparticular speech labels, 
sucb as /b/ and /d/, occur subsequent to their cate^rization. Unfortutiately, 
such speculation takes us uncomfortably far from the available data. 

As an -internal .sujmmary then, postadaptation boundary shifts in identifica- 
tion functions of certain'speech and musiclike stimuli seem to be explicable in 
terms of neural fatigue* Exactly which "pathways fatigue remains an important 
question. I have suggested that the fatigue takes place in the neural pathways 
prior to arrival at the feature analyzers.. Such an account would easily allow 
for differential magnitudes of boundary anif ting according to the number of 
differences between adapting, and test aljimuli. The analyzers themselves lie 
considerally beyond the point at which the two ears converge, and they may not 
suffer from adaptation "fatigue." They appear to be binary, at least with re- 
gard to^ any one stimulus, and they appear to function according ta signal-^detec- 
tion criteria and a simplified Pandemonium model. It seems cleaf that they can. 
be either linguistic or nonlinguistic in nature. Most may be solely linguistic 
and the few others may be linguistically, relevant . ' Remember that tise time cues 
not only the difference^ between pluck and, bow, but also the difference between 
//a/ and /t/a/. At present, we have found only one set of analyzers that over- 
laps the domains of speech and music. It would be of considerable interest if 
o*thers can be found that perform this apparent dual functioVi. It would also be 
of interest to find possible binary music analyzers that are not relevant to / 
speech. . , , 

F. Summarizing Remarks 

What of the magical number two? Part of the -mswer is the same as for the 
magical number seven*- Miller (1956) noted feeven wonders of the world, seven 
points on a psychological rating scale, seven seai^, seven categories of absolute 
judgment, seven deadly sins, seven objects in the span of attention, seven days 
of the week, and seven digits in the span of Immediate memory* He suspected 
that all these sevens wfere merely "pernicious, Pythagorean coincidence." Such 
coincidence is found even more easily in twos: tha two faces of Janus; the two 
types of learning, operant and respondent the two cosmic forces, Yin and Yang; 
the two minds, conscious ^d unconscious; the two sexes, male and female, the 
two searches through memory, self- terminating and ^exhaustive^ the two diurnal ^ " 
segments, day and night; the two locales for research, laboratory and field; 
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■ , * ■ ■ 

and many other two8> such as up and down, self and others, . and Indeed /ba/ and 
/da/, and pluck and bow. Jakobson, Fant, and Halle suggested that the dichotom- 
ous scale Is the pivotal principle of linguistic structure; a quick glance at 
twoness elsewheri^ sugges'ts that it may be the pivotal principle by which we 
parse the worldi^ (see Ogden, 1932). ■ ^ 

o ■ '' * • 

Beyond any "pemiciousne'ss" of the magical number two, htimans appear to be 
predisposed to perceive certain auditory events in a dichctomous manner. I . 
first discussed such discrete perception in terms of identificatioij/ functions 
for an acoustic continuum of speech sounds from /ba/ to /da/ generated by a 
computer-driven speech synthesizer. This firdt expression of the magical. number 
two proves not to be crucial* Litres slanted at various angles yield equally 
quantal identification functions. The crux of the' magical number two is re- 
vealed in the discrimination functions, where for ttie stop consonants one can 
discriminate oniy as well as one can identify, but for the slanted lines one can 
discriminate almost infinitely better. A continuum of vowel sounds^ from /i/ to 
/i/ yields intermediate results, and these results appear to be interpretable in 
terms of a small perceptual deviation from tl^e number two. The peculiar non- 
linearity found in the discrimination of stop-consonant sounds is not unique to 
speech items. In fact, evidence for the dichotomous perception of musiclike^ 
sounds is Just as striking as that for speech. 

Our perceptual predisposition toward the magical number two appears to stem 
from biological endowment. Infants as young a& one and two months parse certain 
speech and music grounds in a manner functionally identical to t)iat of adults. 
Such results^^dlcate that these categories and boundaries are natural according 
to the most/stringent criterion; indeed, they appear to be innate. 

The neural mechanisms underlying our perception by twos can be thought o^ 
as yoked-pairs of fe^ure analyzers lying well beyond the cochlea. Some may be 
unique to speech ana!].ysis, others, like thosle for pluck' and bow musical sounds, 
may be used in both speech and nonspeech analysis. Continued presentation of a 
particular stimulus prototype appears to fatigue selectively the feature-analysis 
system, and temporary shifts in the locus of category boundaries are obtained. 
These mechaifilsms appear to allow for high-speed speech perception, the rapid 
categorization of a particular speech sound, and the discarding of its nonproto- 
typic vagaries; and they alao allow for great savings, since the speech sounds 
are coded into tight bundles of phonetic features suitable for memory and stdr- 
age . 

^ In music, however, the role of feature analyzers is less clear. Pluck and 
bow categories obviously discriminate among modes of playing certain musical in- 
struments, and may relate to the reason these various stringed instruments were 
invented, but beyond this minor role and in the absence of other binary musical 
features (which may yet be .discovered) , we simply do not know why they exist. 
It may be that they were auditory precursors to phonetic feature detectofb, as 
if nature were experimenting xd.th the feasibility of such devices. ^ It may be 
that they are related to orienting and startle mechanisms: soutids with rapid 
onsets often forebode danger, whereas sounds with more gradual olisets are. more 
likely to be associated with "safer" events. Beyond these speculations it is too 
early to say what their purpose may bel Indeed, as Miller suggests, the mere 
existence of pluck and bow categories may be a "pernicious, Pythagorean coinci- 
dence," but it is preferable to think that they will eventually be tied to a 
theoretical fabric relating the structures of speech and music. 
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Processing Two Dimensions of Nonspeech Stimuli: The Auditory-Phonetic 
Distinction Reconsidered*. 

o + ' 4. ++ 

Mark J. Blechner, Ruth S. Day, dnd James E. Cutting 



^ ABSTRACT 

Nonspeech stimuli were varied along two dimensions r intensity 
and rise time. In a series of speeded classification tasks, subjects 
were asked to identify the stimuli in terms of one of these dimen- 
sions. Identification time for the dimension of rise time increased 
when there was irrelevant variation in intensity; however, l^^entifi- 
cation of intensity was unaffected by irrelevant variation ±a rise 
time. When the two dimensions varied redundantly, identification 
time decreased. This pattern of results is virtually identical to 
that obtained previously for stimuli that vary along a linguistic and 
a rionlinguistic dimension. The present data, taken together with 
those of other studies using the same stimuli, suggest that the mech- 
anisms underlying the auditory-phonetic distinction should be recon- 
sidered. The results are also discussed in terms of general models 
of multidimensional information processing. 

Several contemporary accounts of speech perception have emphasized the or- 
ganization of processing into a hierarchy of levels, including auditory, phonet- 
ic, phonological, lexical, syntactic, and semantic (Fry, 1956; Stevens and 
House, 1972; Studdert-Kennedy , in press). The distinction between phonetic ^nd 
higher levels has been commonly accepted by linguists and psychologists for some 
time. Recently, however, much attention has been directed toward the auditory- 
phonetic distinction (e.g., Fant, 1967; Stevens and Halle, 1967; Studdert- 
Kennedy, Shankweiler, and Pisoni, 1972). Fry (1956), in an early discussion of 
the levels-of-processing view of speech perception, emphasized the role of the 
"physical-psychological transformation" that occurs in the recognition of pho- 
nemes from the acoustical signal. The important characteristic of this trans- 
formation is that there is no one-to-one relationship between "the number and 
arrangement of physical clues and the sound which is recognized" (p. 170). 
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.Fry did not state that the physical-psychological transformations characteristic 
of speech are exclusive to speech. However, this possibility was emphasized by 
later work that viewed speech perception as mediated by artlculatory mechanisms 
(Liberman, Cooper, Shankweilet, and Studdert-Kennedy, 1967; Stevens and House, 
1972). Such a view made it desirable and perhaps necessary to partition all 
sounds into two general classes: those that are speech and ^those that are not. ^ 

Definitions concerning which criteria must be met in order for sounds to be 
classified as speech have varied in the literature. The present paper assumes a 
two-part definition of speech: sounds that (1) can be articulated by the human 
vocal apparatus; and (2) can be recoded into higher-order linguistic units. 
•According to this definition, phonetic processing may make reference to artlcu- 
latory processes either directly (Liberman et al., 1967) or implicitly (Stevens 
and House, 1972), and follows some system of linguistic organization such as a 
distinctive feature system (Jakobson, Fant, and Halle, 1963; Chomsky and Halle, 
1968). Furthermore, while all sounds undergo auditory processing, only speech 
sounds undergo phonetic processing. 

To determine the empirical validity of the audi toi^-phone tic distinction, 
many experiments have been conducted. Results from B&fetal paradigms suggest 
that even though speech sounds differ along a wide va/idty of acoustic dimen- 
sions, they are perceived in ways that are qualitatively distinct from the way 
nonspeech sour^ds are perceived. For example, the main difference between the 
phonemes /ba/ and /da/ is the direction and exteht of the second-formant transi- 
tion (Liberman, Delattre, Cooper, and Gerstman, 1954), whild the distinction be- 
tween /ba/ and /pa/ lies in the latency between the initial plosive burst and 
the onset of voicing (Lisker and Abramsonj 1964). Yet, for both distinctions, 
there is no one-to-one relationship between changes in the acoustic patterns and 
probabilities of identification. Instead, item identifications remain at or 
near 100 percent as one phoneme or the other, with an abrupt crossover at the 
"phoneme boundary." More importantly, whereas most stimulus dimensions in the 
environment, such as pitch, intensity, and brightness, can be discriminated from 
anothfer much more accurately than they can be identified (Miller, 1956), this is 
not the case for several linguistic dimensionfiL- Instead, two acoustically dif- 
ferent stimuli that lie within the same phoneme category are discriminated at 
"near-chance level; two stimuli that lie in separate categories but differ by the 
same acoustic increment are discriminated Wth very few errors. .This nonlinear 
mode of perception is called "categorical perception" (Liberman, Harris, Hoffman 
and Griffith, 1957; for a review, see Studdert-Kennedy, in press). 

Other experimental operations have shown processii\g differences for speech 
and nonspeech stimuli. Recent work using a selective adaptation paradigm has 
shown that repeated presentation of a consonant-vowel (CV) syllable produces 
systematic shifts in the phoneme boundary (Eimas, Cooper, and Corbit, 1973; 
Eimas and Corbit, 1973). Some experiments suggest^^that the basis of this adap- 
tation is phonetic rather than auditory (Cooper, 1975), although this evidence 
is not conclusive (Studdert-Kennedy, in press). The auditory-phonetic distinc- 
tion seems to be further supported by dichotic identification tasks that often 
- reveal right-ear advantages for speech stimuli (e.g., Kimura, 1961; Shankweiler 
and Studdert-Kennedy, 1967) and left-ear advantages for nonspeech stimuli (e.g., 
Kimura, 1964; Chaney and Webster, 1966; Curry, 1967). 



Some authors use the terms "linguistic and nonlinguistic** or "verbal and non- 
verbal" to indicate the same distinction. 
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In addition, several experiments have investigated the relationship of 
auditory and phonetic processes in selective attention tasks using stimuli that 
vary along two dimensions. When both dimensiox^s are linguistic (^.g** initial 
stop consonant and vowel in CV syllables), selective attention for either dimen- 
sion is impaired by irrelevant variation in the other dimension (Wood and Day, 
1975). This pattern of symmetric int-erference also occurs when both dimensions 
are nonlinguistic, such a3 pit<:h and intensity (Wood, 1975). However, when one 
dimension .is linguistic anl the other is not, a pattern of . asymmetric interfer- 
ence appears; reaction time (RT) for identification of stop consonants is im- 
paired by irrelevant variation in pitch, but RT for pitch identification is not 
increased significantly by irrelevant variation in stop consonant (Day and Wood, 
1972; Wood, 19Z4, 1975 )y These results have been interpreted to support the 
auditory-phonetic distinction. 

The dicho tic. listening, categorical perception, selective adaptation, and 
speeded classification experiments appear to comprise a set of converging opera- 
tions (Gamer, Hake, and Eriksen, 1956) on the r^ychological reality of the 
auditory-phonetic distinction. Recently, howev«, this view has been brought 
into question by experiments in which the stimuli are sawtooth-wave tones dif- 
fering in rise time (see Cutting, in press). While rise time can cue a speech 
distinction, such as the difference between the //a/ and /t/a/, sawtooth waves 
are not perceived as speech. Instead, they sound comparable to a plucked of 
bowed violin string. Although these "plucks" and "bows" are not processed pho- 
netically, they are processed similarly to speech in several ways* They are per- 
ceived categorically (Cutting and Rosner, 1974), and their identification func- 
tions shift in the same manner as for speech following selective adaptation 
(Cutting, Rosner, and Foard, 1975). Ear advantage data for plucks and bows are 
not yet decisive. , 

r 

The present experiment seeks to investigate further the processes by which 
the plucks and bows are perceived and their relationship to the auditory-phonet- 
ic distinction. The two-choice speeded classification procedure developed by 
Garner and Felfoldy (1970) and modified for use in auditory experiments by Day 
and Wood (1972; Wood, 1974, 1975; Wood and Day, 1975) was used to determine how 
the dimensions of rise time and intensity interact. If symmetric interference 
necessarily results when stimulus dimensions are of the same general class— 
i.e. , both linguistic or nonlinguistic — then selective attention to either rise 
time or intensity should suffer from irrelevant variation in the other dimen- 
sion. However, if a pattern of asymmetric interference occurs, it would be 
clear that such a pattern need not be based on separate auditory and phonetic 
levels of processing. This pattern of results, along with those of other 
studies using pluck and bow stimuli, would lead one to question the mechanisms 
underlying the auditory-phonetic distinction as currently conceived. 

The present experiment is also concerned with current conceptions about the 
processing of multidimensional, stimuli in general. The pattern of as3rmmetric 
interference suggests a serial model; only the dimension processed first inter- 
feres with the other. However, this view has been challenged by the finding 
that when the two dimensions of the same stimulus vary redundantly, subjects can 
identify them more quickly than when only one dimension varies (Wood, 1974). 
This redundancy gain (Garner and Felfoldy, 1970) seems to argue against a serial 
model. If pitch processiiyg were completed before processing of the stop conso- 
nant began, a subject should not bje able to use redundant information about the 
stop consonant to speed up identification of pitch. Instead, a' parallel model. 
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which posits that processing of jthe two dimensions overlap or are simultaneous,^ 
Would better accountsj; for the redundancy gain. Information-processing models 
that propose strict serial or parallel processing do not seem to offer an ade- 
.quate explanation of Wood's (1974).^ finding of asymmetric interference with re- 
dundancy gain. It seems ],ikely, instead, that subjects have some degree of 
freedom abo^t the kinds of processing that they use in different task conditions. 
The present experiment, by also including a task that varies the two dimensions 
redundantly, seeks to distinguish' the conditions in which processing strategies 
are optional from those in which they are mandatory. 

METHOD 

Stimuli f 

Stimuli varied along two dimensions — intensity and rise time* They were 
derived from the sawtooth waves used by Cutting and Rosner (1974), generated o^ 
the Moog synthesizer at the Presser Electronic Studio of the University of 
Pennsylvania. The original stimuli differed in rise time and resembled the 
sound of a plucked or bowed violin string. The pluck and bow stimuli reached 
maximum intensity in 10 and 80 msec, respectively. Two more stimuli of lower 
amplitude were created by attenuating the original stimuli 7 dB, using the pulse 
code modulation (PCM) system at the Haskins Laboratories (Cooper and Mattingly, 
1969). Thus, the final four stimuli were loud-pluck, soft-pluck, loud-bow, and 
soft-bow. The absolute level of the^loud and soft stimuli were 75 and '68 dB re 
20 jjN/m^, Respectively. All stimuli were truncated to 800 msec in duration (the 
original stimuli were approximately 1050 msec), and then digitized and stored on 
disc file using the PCM system:. Items were reconverted to analog form at the 
time of tape recording. 

Tapes ^ 

All tapes were prepared on the PCM system. Test stimuli were recorded on 
one channel' of the audio tape. On the other channel, brief pulses were synchron- 
ized with the onset of each test stimulus; these pulses triggered the RT counter 
during the experimental session. 

A display tape was prepared to introduce the listeners to the stimuli. The 
four stimuli were played in the same order several times, beginning with three 
tokens of each item, then two of each, and finally oi\e of each. Practice tapes 
were also prepared, consisting of a randomized order of 20 items, five of each 
stimulus. There were two practice tapes, each with 9 different random order. 

The eight test tapes each contained 64 stimuli with a 2-sec interstimulus 
interval. Each tape was composed of different subsets of the four stimuli, de- 
pending on the condition of the experiment. In the control condition, the stim- 
uli varied along only one dimension, while the other dimension was held constant. 
Thus, for example, one intensity control tape consisted of loud and soft bows 
only, while the other consisted of loud and soft plucks. For half the subjects, 
the nontarget dimension (in this case, rise time) was held constant at one value 
(pluck), whereas for the other half, it was held constant at the other (bow). 
Rise-time control tapes were constructed in an analogous fashion. In the ortho- 
gonal condition, both dimensions varied independently. Hence, the two tapes for 
this condition contained all four kinds of stimuli, in different random orders. 
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In the correlated condition, the dimensions varied. In a completely redundant 
manner. In this experiment all of the pluck stimuli in this condition were loud 
and all of the bow stimuli were soft. See Table 1 for a complete outline of the 
stimuli in each condition. % 



TABLE 1: Stimulus eets for each target dimension and condition. 

Condition ■ 

Orthogonal 



Target Dimension Control 



Correlated 



Rise Time 



Loud-pluck^^^^ud-bow 
Soft-pluck, soft-bow 



Loud-pluck 
Soft-bow 



Loud-pluck 
Soft-pluck 
Loud-bow 
Soft-bow 



Intensity 



' Loud-pliJpk,' soft-plyck 



or 



Loud-bow, soft-bow 



Loud-pluck 
^of t-bow 



Loud-pluck 
Soft-pluck 
Loud-bow 
Soft-bow 



Subjects and Apparatus 

•» 

The six subjects (five males and one female, from 19 to 27 years of age) 
participated in all six tasks. All reported no history of hearing trouble. 

The tapes were played on an Ampex AG-500 tape reccTrder and the stimuli were 
presented through calibrated Telephonies headphones (Model TDH39-300Z). Sub- 
jects sat in a sound-insulated room and responded with their dominant hand on 
the two telegraph keys mounted on a wooden board. Throughout the experiment, 
the left key was used for pluck'' and loud responses, while the right key was used 
for bow and soft responses. The pulse on one channel of the tape triggered a 
Hewlett-Packard 522B electronic counter. When a response on either telegraph 
key stopped the counter, the reaction time was registered onto paper tape by 
Hewlett-Packard 560A digital recorder for subsequent analysis. The listener's 
response choice was recorded manually by the experimenter. 

Procedure ^ 

At the stkrt of the session, subjects were informed of the general nature 
of the experiment and of the dimensions they would be asked to identify. They 
were told that the difference in rise time could be compared to the difference 
in sound between a plucked and a bowed violin string. 

For preliminary training, subjects heard the display sequence twice. Next, 
they listened to the two setg of 20 practice items and responded verbally, first 
attending to rise time and then to intensity. This ensured that the subject 
could perceive the differences along both dimensions. They then repeated the 
same practice trials, responding with a key-press rather than verbally, in order 
to become familiar with the mode of response. Subjects were instructed to re- 
spond aa quickly as possible without making errors. 
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The order of presentation for the six test tapes was determined by a bal- 
anced Latin Square design. Before the test trials of each condition, appropri- 
ate instructions were read. The subject was then given eight practice trials to 
help stabilize RT performance and to familiarize him with the identification 
task and stimulus set for that particular condition. In the control conditions 
the subject was told which dimension to attend to and the value at which the 
other dimension would be held. In the orthogonal conditions, he was instructed 
to attend to one dimension and to ignore variation in the other dimension. In 
the Correlated conditions, he was instructed to attend to one dimension, but was 
encouraged to use the additional information from the other dimension. 

. ' / ' 

RESULTS 

Both dimensions were easy to identify. In the practice trials, no subject 
made more than 2 percent errors. During test trials, the highest mean error 
rate for any condition was 1.8 percent. A three-way analysis of variance of the 
error data (subjects x conditions x dimensions) revealed no significant main 
effects nor interactions. Therefore^ the error data will not be considered in 
any detail in this discussion. 

For the reaction time data, median RT was calculated for each individual 
block of trials for each subject, and means of medians for each condition across 
subjects were computed.^ In addition, the untransf ormed RT data were subjected 
to a complete four-way factorial analysis of variance (subjects x conditions x 
dimensions x within cell). 

Median RT data are presented in Table 2. For the dimension of rise time, 
there was an increase of 53.5 mgec from the control to the orthogonal condition, 
while for the dimension of in teMity, there was an increase of only .2 msec. In 
the analysis of variance, the ejfjf^fit of dimensions (intensity versus rise time) 
was not significant, while the ,;ef feet of conditions (control, orthogonal, and 
correlated) was significant [P["(2, 126) - 33.23, £<.001]. In addition, the 
ditnensionB x conditions interaction was significant [R. (2, 126) - 6.43, £<.01]. 
In order to differentiate interference effects from redundancy gains, a contrast 
of the interactions between the two dimensions and only the control and ortho- . 
gonal conditions was performed, omitting correlated conditions. This contrast 
was significant [F (1, 63) « 16.58, £<.001]. Thus there was an alsymmetric pat- 
tern of interference; Intensity variation interfered with the processing of rise 
time, while rise time had virtually no effect on the processing of intensity. 
This finding is especially interesting given that intensity was 6omewh^j(;:^mor€r . 



Wood and Day (1975), in alLj3f their reaction-time experiments, transformed 
their data, so that any RT^-Anger than 1 sec was set equal to 1 sec. T^his was 
done to correct for possible malfunctioning of the equipment, such as fmlure 
of the response key to make electrical contact, or temporary inattention of the 
subject. While we agree that unusually long reaction times due to equipment 
trouble or lapsing of the subject's attention should be transformed, the con- 
tinual resetting of long RT values to 1 sec seems arbitrary and could distort 
the data. If arithmetic means are to be used, very long data points can be 
more equitably adjusted by using reciprocal RT values. Alternatively, medians 
or trlmeans can be used, \d.th similar effect. 
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discrimlnabXe than rise time, although the difference of 14.2 msec between the 
control conditions was not statistically reliable. 

f . 

TABLE 2: Median reaction time In^milllseconds 
for each dimension and condition. 

Condition 

Dimension Control Correlated Orthogonal 

Rise time 426.5 393.7 ^480 

Intensity 442.3 406.4 442.5 

The effect of redundancy gains was assessed by two methods. A contrast of 
the conditions effect showed the correlated conditions to be significantly dif- 
ferent from the control conditions [F (1, 63) - 35.1, 2<.001]. A subsequent 
comparison of the individual means using the Newman-Keuls procedure showed the 
correlated conditions in both dimensions to differ significantly from the re- 
spective control condition. The different correlated conditions, like the con- 
trol conditions, did not differ significantly from each other. 

- In order to determine whether the redundancy gain could rightfully be con- 
sidered as evidence of parallel processing of the two dimensions in this^ experi- 
ment, three alternative explanations of the redundancy gain were- ruled out, as 
in Wood (1974). First, the possibility of a different speed-accuracy trade-off 
in the two correlated condit:ions could be eliminated by the lack of significant 
differences in the error data, as noted above. 

Second, the possibility that the redundancy gain could be due to selective 
serial processing (SSP; see Felfoldy and Gamer, 1971) was considered. If a 
subject uses the SSP strategy in the correlated condition, he merely attends to 
the more discriminable of the two dimensions for him, regardless of the instruc- 
tions. Thus, his RT data would stiow that neither correlated condition is faster 
than the faster control conditiofi. To test for the occurrence of the SSP strat- 
egy, the RT data for each correlfitf^d condition was tested against each subject s 
faster control condition with an analysis of variance and a subsequent compari- 
son of means u^ing the Newman-Keuls method. The correlated conditions were still 
found CO be faster than the faster control condition [F (2, 63) - 13.5, 2<.001]. 
Therefore, the redundancy gain cannot be attributed to the SSP strategy. 

Finally, because the stimulus sets in the ,two correlated con4itions were 
the same, whereas in the control conditions they were different, it is possible 
• that the redundancy gain could be based on differential, transfer between control 
and correlated conditions (Biederman and Checkosky, 1970). To test this explana- 
tion, an analysis of variance of the control and correlated conditions (subjecfts 
X conditions x order of presentation' x within pell) wa& performed. The control 
condition presented first w^ 16 msec slower than the second, suggesting a pos- 
sible practice effect, although this difference was not significant. The corre- 
lated conditions presented first and second differed by only 2 msec— which was 
not significant. Thus the transfer between the correlated conditions was Ibbb 
than or equal to the transfer between the control conditions, so that differen- 
tial transfer doe^ not account for the redundancy gain. ^ 
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DISCUSSION 



The pattern of results in this experiment is remarkably similar to those of 
Wood (1974) . The relationship of intensity to rise time matches that of pitch 
to initial stop consonant both in the asymmetric pattern of interference and in 
the significant redundancy gain. In Gamer's (1974) terminology, the dimensions 
of rise time and intensity are therefore asy™tf>trically integral . In the fol- 
lowing discussion we will consider first the implications of our results in 
terms of general information-processing models and then reconsider the auditory- 
phonetic distinction. 

The present results pose problems for infomnation-processing models that 
try to account for perception only in terms of the serial or parallel handling 
of information. The data suggest that both stimulus and task characteristics 
may affect the mode of processing, so that neither a strict serial nor a strict 
parallel model can account for the whole picture. This view agrees with the re- 
cent suggestions of several authors (Nickerson., 1971; , Townsend, 1971; Garner, 
1974). An alternative to the strictly serial and parallel models Is that the 
two processes overlap temporally and that one is contingent on the other, as 
suggested by Turvey (1973). Perhaps the processing qf rise time and intensity 
(as well as of place of articulation and pitch) begin simultaneously. Both 
kinds of information can be combined to produce a redundancy gain in the corre- 
lated condition. However, it may be that only the orthogonal condition, which 
requires information gating (Posner, 1964), can reveal the contingency of one 
kind of information on the other. Current theories, however, do not account for 
why this contingency relationship might affect one task and not the other. 
Further research is needed on this point. 

♦ 

One useful approach might be to vary the discriminability of the two dimen- 
sions and to note the changes that occur in both the orthogonal and correlated 
conditions. For example, Blechner and Cutting^ performed a speeded classifica- 
tion experiment using rise time and pitch,* where the latter cfimension was con- 
siderably more discriminable than the former. The result was that the- subjects 
always processed the more discriminable dimension first. Thus, in the orthogo- 
nal condition, RT performance was equal to the faster control condition (the SSP 
pattern). However, it would be more interesting to determine whether, by man- 
ipulating discriminability in the reverse direction, rise time can be made to 
interfere wit;h intensity. How much more discriminable than Intensity would rise 
time have to be for a pattern of mutual interference to appear? If RT perfor- 
mance in the correlated condition were only as fast as the faster control condi- 
tion, one might conclude that redundancy gains are impervious to the effects of 
contingency processing relationships between diihensions. Such a finding would 
be congruent with the results of the experiment reported here. 

The present results tear not only on the way that different levels of pro- 
cessing interact, but also on the very question of which level^ of processing 
are important in human auditory perception. In light of the present data, which 
show asymmetric interference between two "auditory" dimensions, it seems unwise 
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to lump together all acoustic properties that do not provide linguistic cues. 
At the very least, a heterogeneous conception of auditory analy^zing systems, 
with processing levels of increasing complexity, seems .preferable. 



The present data also suggest several basic positions concerning the issue 
of speech and nonspeech processing systems. One view suggests that two* analo- 
gous processing systems exist — one for speech and one for nonspeech. Both are 
subsequent to a preliminary analysis of all acoustic stimuli, and both are or- 
ganized to produce identical results in the speeded classification task. This 
explanation, however, may be challenged on grounds of parsimony: Why conceive 
of two systems when both behave in the same way in several circumstances? 
Cautious theorists, however, may object that analogy is not identity, and that 
the theory of separate speech and nonspeech systems should remain as. long as 
perfect ^^gruence betv^een the two systems has not been demonstrated. 

An alternative account of the present results is that perceptual processes 
are not divided according to the status of the stimulus dimensions as speech or 
.nonspeech, but rather with respect to the kinds of acoustic analysis that the 
signal requires. Thus, rise time seems to be perceived in the same way, regard- 
less of whether it character;Lzes a speech sound, as //a/ and /t/a/, or a non- 
speech sound, as in plucks and bows (see Cutting and Rosner, 1974, for experi- 
ments that make this comparison directly) . Additional support for this view may 
be found in the work of Miller, Pastore, Wier, Kelly, and Dooling (1974), who 
found that noise-buzz sequences with varying relative onset times were also per- 
ceived categorically. The stimuli varied in a manner analogous to the voice- 
onset-time continuum; thus there appears to be a comparable mode of perception 
for such stimuli^ regardless of whether they are speech or nonspeech. 

It may be possible to consolidate the above two views by finding a common 
conceptual relationship between the pairs "auditory-phonetic" and "intensity- 
rise time." By the definition of the term "phonetic" established above, it is 
clear that plucks and bows are not phonetic, since they cannot be articulated by 
the human vocal tract. However, the status of plucks and bows with respect to 
the second part of the definition, that the sounds can be recoded into higher- 
order linguistic units, is less clear. ^ Certainly, they are not part of spoken 
language: but they do comprise lower level components in the "language" of 
music, which, like human spoken language, can be divided into hierarchical 
levels of organization (e'.g., pitch, timbre, and harmony; see Nattiez, 1975, for 
a more complete discussion of this problem). 

It is important to determine the extent to which plucks and bows are pro- 
cessed in terms of their specific acoustic characteristics, their status as non- 
speech, or perhaps the role they play in a hierarchically organized system of 
sound. Various data have been used to support the view that it is the status t)f 
sounds as speech or nonspeech, rather than specific acoustic characteristics, 
that determines whether certain kinds of' processing will occur. For example, 
when variations in acoustic dimensions analogous to those characteristic of cer- 
tain phonemes~such as isolated second formants — are presented in a nonspeech 
context for identification and discrimination, perception is no longer categor- 
ical (Mattingly, Liberman, Syrdal, and Halwes, 1971). However, such sounds are 
not only nonlingulstic, they also bear ^little res^fiblance to sbunds that common- 
ly occur in the listener's environment. 'In contrast, the plucks and bows used 
in this experiment, which are not phonetic by tjte strictest definition, do 
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resemble c^pta&pnly heard musical sounds. It is an intriguing possibility that 
the plucks and bows are perceived in way9 similar to speech partly because of 
their "codability" or 'Waningfulness" for the listener.. Perhaps it is the in- 
teraction of the acQubtic nature' of the sound and its significance for the lis- 
tener that leads to the kinds of perceptual phenomena that have been considered 
exclusive to speech. 

The results of experiments using noise»-buzz stimuli (Miller et al. , 1974) 
appear to argue againdt the analogy between, the perception of basic musical 
units and phonemes. However^ unlike plucked' and bows, the noise-buzz sequences 
have been shown to be perceived as speech in only oigie experimental paradigm. 
If, in fact, they did match the speech results in other paradigms, such as 
selective adaptation/ and speeded classification, one would still want to ascer- ' 
tain whether subjects phenomenologically experience the stimuli as resembling 
common environmental sounds that are codable in a hierarchically organized sys- 
tem .of sound like music and language.^ 

In conclusion, we do not question that levels of processing separate cer- 
tain linguistic and nonlinguistic dimensions of the same stimuli. We suggest, 
rather, that the crux of the auditory-phonetic distinction is, as Fry C1956) 
suggested, a "physical-psychological transformation." The nature of this trans- 
foxnnation probably cannot be accounted for solely in-^terms of the linguistic- 
nonllnguistic d^iStinction. Instead it may be based on acoustic properties alone, 
on the coding of sounds within a hierarchically organized system, or on the in- 
teraction of acoustip properties with such a system. 
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Predicting Initial Cluster Frequencies by Phonemic Difference 
James E. Cutting* * 



ABSTRACT 

The^ frequency of occurrence for stop-liquid and stop-semivoyel 
clustersf can be predicted on the basis of the number of distinctive 
features that separate the member phonemes: the greater the phonemic 
difference, the more frequent the cluster. Preidictions made in this 
manner are generally much better thait those made from chance co- 
occurrence of successive phonemes. Assessments are made on four ex- 
tensive corpora. Each of six. distinctive features is examined indi- 
vidually. 

Why does the phonology of a given language permit certain phoneme clusters 
to occur, but not others? Why do some clusters occur frequently, others rarely? 
Saporta (1955), among others, suggested that the answer to both questions may 
lie in an application of Zipf's law (Zipf, 1949): "The relative frequency of ' 
consonant clusters will reveal a tendency on the part of any language system to 
produce speech in such a way as to consider the effort of both the speaker and 
the listener" (Saporta, 1955:25). Zipf <i935) had applied this principle to 
descriptions of phonemes, but Saporta (1955; Keller and Saporta, 1957) first " 
applied it to phoneme clusters. Unique to Saporta* s approach was the conjoining 
of a phonetic feature system (Jakobson, Fant, and Halle, 1951) and the principle 
of least effort. He suggested that phoneme cluster frequency correlated with 
the number of distinctive features not shared between the member phonemes of a 
given cluster. ' Altmann (1969) termed this measure phonemic difference . Saporta 
purposefully excluded clusters with liquids (/I/ and /r/) and semivowels (/w/ 
and /y/) and suggested that these combinations needed further study. The pres- 
ent paper is concerned with these clusters and with predicting their frequencyv 
of occurrence by the principle of phonemic difference. 

Carroll (1958) criticized Saporta* s analyses on the grounds that inadequate 
consideration was given to the possible chance nature of the results. After per- 
forming the proper analyses, Carroll concluded that phonemic differenc?^ has 
merit as a measure for predicting cluster frequency, but that a much more exten- 
sive data base (cluster count) should be used in further research. Following 
this advice, the present investigation uses data available from several exten- 
sive corpora gathered pince the publication of the Saporta and Carroll articles. 
In addition, the principle of phonemic difference is matched against a control 
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(or chance-factor) principle, where cluster frequency is predicted from the fre- 
qi|iancy of ociqurrenGe' of the member phonemes. 

The present research ». however, differs in several ways from previous 
studies. First, it is concerned with only a selected nmuuer of clusters where« 
phonemic difference is not extensive. The first member of all clusters is a , 
stop consonant- /p,b,t,dik,g/, and ; thei second meiriber is either a liquid or a 
semivowel /l,r,w,y/. Thip^ limitation removes from consideration the more widely 
differing clusters u^ed by Saporta and Carroll, which may'follow a ^different 
principle than those investigated here; Second, this limitation alters the main 
hypoth.esis. Saporta suggested that the most frequently occurring clusters 
should be those with intermed.iate phonernlc difference. Here, \I hope to demon- 
strable thai! for certain clusters maximal difference in the number of distinctive 
features shared by successive^ phonemes' serves to predict cluster frequency. It 
should be remembered, however, that Sapott^'s intermediate differences and the 
maximal differences presented in this paper^ are nearly the same: interphoneme 
dissimilarities along five features. Third, the present interpretation of 
phonemic differences is not linked to the principle of least effort.; Saporta 
(1955) noted that for the speaket the situation of least effort should be one in 
which successive phonemejs are Aost similar, but for the listener that situation 
is one in which the phonemes are least » similar . This led him to predict that 
^intermediate i^honemic differeivce should be important, serving both speaker and 
listener. However, Wang (1959) found that the least effort principle did not 
^'pply to the perceptions of ^ final clusters, thus questioning the role of the 
listener in this application *of Zipf's laU. Least efforts for the speaker may 
be difficult ta assess in' an unbiased fashion. Therefore, it seems unwise to 
shackle a principle ^of phonemic difference to the broader principle of least 
effbr'fe; instead, it should stand' alone. Fourth, the present study is concerned 
only with initial clusters rather than with both initial and final. The major 
reason for this liiqltation is simply that Inhere are very few liquid-stop and 
semivowel-stop clusters in English. 

Before presenting evidence supporting a revised phonemic difference princi- 
ple, a few matters concerning methodological approach must b^ mentioned. First, 
the particular distinctive features system usidd is tha^ proposed by Halle (196A) , 
elaborated from earlier versions (Jakobson, Fant, and Halle, 1951; Cherry, Halle, 
and Jakobson, 1953). The subsequent feature system of Chomsky and Halle (1968) 
is rejected on the grounds that many of ther additional features 'are redundant 
for the particular phonemes selected here. The values for each stop, liquid, 
and semivowel, are made explicit by Wickelgren (1966) using Halle's definitions, 
and they are shown in Table 1. Second, each feature is considered equally im- 
portant. Such an assumption is dangerous. Carrol (1958), for example, suggested 
that each feature should be considered separately, since voicing alone contri- 
buted extensively to Sapotta's main finding. Separate analyses can confirm 
whether phonemic difference is a general principle, for phoneme clusters, or 
whether It is limited only to certain contrasts. To demonstrate the generality 
of this principle, proper analyses will be performed. Third, the present paper 
considers all stop-liquid and stop- semivowel combinations to be true clusters. 
•This assertion goes against the view that stop + /y/ combinations, for example, 
may not be legitimate clusters since they on^y occur before the vowel /u/ 
(Hofmann, 1967; Chomsky and Halle, 1968), or that /kl, kr, kw/ combinations might 
be more parsimoniously described as single consonants (Hofmann, 1967; Menyuk and 
Klatt, 1968; Menyuk, 1972). For a review of the phoneme-versus-cluster contto- 
versy, see Devine (1971). Fourth, in order to predict cluster frequencies and 
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TABLE 1: Distinctive feature representation from Halle (1?64; see also 
Wickelgren, 1966) of certain phonemes In English that can form 
initial clusters. 

Stop consonants v. Liquids Semivowels 

Labials Alveolars Velars 





/P/ 


/b/ 


/t/ 


/d/ 


/k/ 


/g/ 
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^ + 
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Consonantal 
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+ 
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+ 
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+ 
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Diffuse 
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Voiced 
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+ 
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+- 


Continuant 
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+ 



to confirm these estimates » one needs accurate assessments of actual cluster 
occurrence. Four extensive corpora were selected: Denes , (1965) , a corpus in- 
cluding 72,000 phonemes spoken in British English from "phonetic readers" used 
to teach English to foreign students; Hultzfin, Allen, and Miron (1964) , a corpus 
of 20,000 phonemes spoken in General American from selected material in eleven 
different dramatic pldys; Roberts (19,65), an analysis of a word list incj.ud 
15 million entries phonemically transcribed in General Americai^; andtSrt^^' 
(1966), an analysis of a phoneme count frqm 100,000 words of coiinetted material 
transcribed in British English. Fifth, to provide a prediction of cluster fre- 
quency based an chance co-occurrenice of successive phonemes, accurate assess- 
ments of the frequency of each of the ten phonemes in question are needed. The 
four sources cited above provide these estimates. They agree fairly well with 
others (Hayden, 1950; Tobias, 1959; Delattre, 196^; Card and Eckler, 1975). See 
Gerber and Vertin (1969) and Wang and Crawford (1960) for comparisons. 

Table 2 displays the phonemic difference between member phonemes for the 24 
clusters, along with observed and predicted percentages. Cluster frequencies 
were determined separately for each corpus. ^ Observed percentages were then cal- 
culated by dividing the number of occurrences pf each^ cluster by the total occur- 
rences of all 24 clusters. To compute predicted percentages, the percentage 
occurrence for each of the ten individual phonemes was first obtained; the pro- 
duct of the percentages for the two phonemes in each cluster was determined; this 
product was divided by the sum of th^ products for all 24 clusters, and multi- 
plied by one hundred. 

Notice the variation among the fouir corpora. In particular, variation is 
greater fbr observed frequencies than for predicted frequencies. For example, 
observed frequencies for /pr/, /kr/, and I til differ by factors of 3:1 or 
greater, whereas predicted frequencies differ by much less. 

Eight correlation coefficients w^re calculated, two for each corpus. First, 
predicted and observed cluster frequencies were correlated within the same 
corpus, then the phonemic difference scores were correlated with the obtained 
cluster frequencies. The results of these analyses are shown in Table 3, along 
with statistical assessments of each, 'in additidh, the mean predicted and mean 
observed frequencies were caj.culated from the four corpora, and correlations 
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TABLE 2: 



Twenty-four initial clusters, their phonemic difference, and 
their observed and predicted frequencies from four different 
corpora. 
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/pr/ 


5 


11. 1 


2.3 


13.3 


1.9 


25. 7 


3.0 


9.0 


5.6 


/pi/ 


4 


10. 7 


3.1 


8.3 


1.1 


6.5 


1.5 


6.0 


3.0 


/pw/ 


3 


.1 


2.1 


- .0 


2.1 


.0 


2.1 


.0 


1.7 


/py/ 


4 


.5 


1.2 


2.2 


3.0 


.7 


3.1 


.9 


.5 




4 


4.6 


In 1 

2.7 


6.6 


2.4 


7.8 


3.1 


10.4 


5.0 


/bl/ 


3 


9.6 


3.6 


3^9 


1.4 


4.7 


1.5 


8.1 


2.7 




2 


.0 


2.5 


.0 


2.6 


.0 


2.1 


.0 


lr5 


/by/ 


3 


.1 


1.4 


.5 


3.8 


.5 


3.1 


.4 


.4 


/tr/ 


4^ 


11.9 


10.8 


11.6 


10.0 


12.4 


13.1 


11.6 


19.8 


/tl/ 


3 


10.4 


14.2 


4.4 


6.0 


.0 


6.5 


.0 


10.6 


/tw/ 


4 


1.5 


10.0 


1.1 


10.8 


1.9 


9.0 


2.7 


^ .6.0 


/ty/ 


3 


2:4 


5.8 


.0 


16.0 

>* 


•0 


13.4 


1.3 


1.7 


/dr/ 


3 


2.8 


5.4 


6.6 


4.1 


6.6 


5.7 


5.9 


12.0 


/dl/ 


■ 2 


2.5 


7.1 


2.2 


2.5 


.0 


2.9 


.0 


6.4 


/dw/ 


3 


.0 


5.0 


1.7 


4.5 


.2 


3.9 


.4 


3.6 


/dy/ 


2 


4.0 


2.9 


.0 


6.5 


.0 


5.9 


1.0 


1.1 


/kr/ 


4 


2.4 


3.7 


8.8 


3.5 


8.8 


4.6 


10.8 


7.5 


/kl/ 


5 


7.1 


4.9 


6.1 


2.1 


7.3- 


2.3 


9.2 


4.1 


/kw/ 


4 


7.6 


3.5 


5.0 


3.8 


4.5 


3.2 


5.0 


2.3 


/ky/ 


5 


2.4 


2.0 


2.2 


5.6 


.7 


4.7 


.9 


4.1 


/gr/ 


3 


6.2 


1.5 


12.2 


1.4 


9.1 


1.6 


10.2 


1.9 


/gl/ 


4 


1.4 


2.0 


1.7 


.9 


2.6 


.8 


5.6 


1.1 


/gw/ 


3 


.1 


1.4 


.0 


1.6 


.0 


1.1 


.4 


.6 


/gy/ 


4 


.5 


.8 


1.7 


2.3 


.0 


1.7 


.0 


.2 


Total 




99.9 


99.9 


100.1 


' 99.9 


100.0 


99.9 


99.8 


99.9 




TABLE 3: Correlations between observed cluster frequencies and 
(a) those predicted from phoneme frequencies within 
the same corpus, and (b) number of distinctive features 
separating phonemes within a cluster, df^ « N-3 « 21 
(see McNemar, 1969) . 



Corpus 



Correlation with Correlation with 

frequencies predicted phonemic difference 
by chance 



Danes (1965) .40 .34 
Hultz^n, Allen, and 

Miron (1964) -.19 .45* 

Roberts (1965) .03 , .52-»- ^ 

Tmka (1966) .45* .46* 

Mean of corpora .20 .47* 

*£<.05, two- tailed 
'*"p<.01, two-tailed 

performed. Notice that for the American corpora the observed frequencies cor- 
related more highly with the phonemic difference scores than did the control, 
or chance-factor, frequency estimates. (This is all the more impressive since, 
owing to the large number of ties in phonemic difference, maximum correlation 
coefficients calculated here can be only .90.) The phonemic difference correla- 
tions for the Hultz^n, Allen, and Miron (1964) corpus and the Roberts (1965) 
corpus were significantly greater than the control correlations ( t^ 2.51, 
£<*025 and 2.33, 2<.05, respectively; McNemar, 1969:157-158). It is inter- 
esting to note that these are the two corpora based on American English pronun- 
ciation, whereas the other two are bas^ on British English. 

Is this principle general and distributed across the various phonetic fea- 
tures? Or is it, as Carroll (1958) suggests, primarily a function of one fea- 
ture? — voicing. The answer can be seen in Table 4. Tliie percent occurrence of 
all clusters that do not share each of the six features is compared against 
chance, calculated by dividing the number of clusters involved by the total num- 
ber of clupters (24). Phonemic difference along all but one feature, consonan- 
tal, fits into the general scheme, providing equal to or greater than chance 
prediction. Consider the features in more detail. , Vocalic and consonantal fea- 
tures cah be yoked since all clusters differ on one or other (but not both) of 
the features; the first separates /l,r/, the second /w,y/, from the stop conso- 
nants. The vocalic feature predicts cluster frequencies very nicely, but the 
consonantal feature does not. Notice that the observed average of these two 
features l^^'^^cactly 50 percent, or chance. Following Carroll (1958), one can 
eliminatethese two features, yolfed together, since they do not provide greater- 
than-chdnce prediction.. The continuant 'feature can also be dismissed since mem- 
bers of \ill 24 clusters differ along this dimension. Only three features remain: 
grave, diiS^se, and voiced. Each of these three features appears to contribute 
nearly equa^My to the phonemic difference effect, providing from 11 to 18 percent 
better-than-chance prediction. Moreover, voicing is not the most potent feature, 
as Carroll suggested it m;Lght be. . 
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TABLE 4: Analysis of the effect of phonemic difference for 
each of the six distinctive features In predicting 
cluster frequency. Data base Is mean of four corpora. 



Distinctive 
'feature 



Observed percent of 
clusters with this 
feature not shared 



Chance 



Clusters Involved 



Vocalic 

Consonantal 

Grave 

Diffuse 

Voiced 

Continuant 



86.3 
13.7 
69.9 

59.2 
66.1 
100.0 



100.0 



50.0 stop + /l,r/ 

50.0 stop + /w,y/ 

58.3 /pi, pr, py, bl, br, 

by, tw, dw, kr, kl, 
ky, gr, gl, gy/ 

41.7 /pr, br, tr, dr, kl, 

kw, ky, gl, gw, gy/ 

50.0 /p,t,k/'+ liquid 

/p,t,k/ + semivowel 

100.0 all 



If the phonemic-difference principle Is tenable as a predictor of cluster 
frequency, one might expect that any phonological change within a cluster should 
be In the direction of Increasing phonemic difference, or perhaps In the elimina- 
tion of the cluster altogether. In American English an Increase can be seen In 
the affrlcatlon of alveolar + /r/ clusters; /tr/ and /dr/ clusters tend to go to 
/t/r/ and /dsr/, as In TRY and DRY. Affrlcatlon, In effect, adds, the additional 
contrast of strldent-nons trident to members of both clusters. According to the 
mean of the four corpora, these two clusters are the second and tenth most fre- 
quent o£ the 24, and the addition of the strident contrast Increases th^lr pho- 
nemic difference to 5 and 4, respectively. In American English the stops /t/ 
and /d/ are Involved In cluster elimination as well. Often these are simplified 
to a single consonant: /ty/ and /dy/ go to /t/ and /d/ In TUBE and DUTY. 
Notice thdt the Hultz^n et al. (1964) corpus and the Roberts (1965) corpus lack 
any /ty/ and /dy/ clusters, whereas the Denes (1965) and Tmka (1966) corpora 
contain a number of them, which Illustrates one difference between American and 
British English. 

In conclusion, a revised version of the phonemic-difference principle postu- 
lated by Saporta (1955) serves to predict cluster frequency. Those clusters In- 
vestigated here are stop-llquld and stop-^emlvovel combinations, which Saporta 
did not consider. As Carroll (1958) suggested, this principle serves to predict 
cluster frequency better than chance, partlc^ilarly for the American English cor- 
pora. The effect Is distributed across the different distinctive features and 
Is particularly strong for grave, diffuse, and voicing features. 
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Hemispheric Specialization for Speech Perception in . Four-Year-Old Children from 
Low and^ Middle' Socioeconomic Classes* [ 



4- H 
Donna S. Geffner and M. F. Dorman 



ABSTRACT 

Four-year-old male and female children from low and middle 
socioeconomic classes (SEC) were tested on a dichotic syllable task. 
Both low and middle SEC, males evidenced significant right-ear advan- 
tages. Neither low nor middle SEC females evidenced a significant 
right-ear advantage. The similar ear advantage in the low and middle 
SEC populations replicates a previous study with six-year-olds and 
suggests that variatiotis in rearing conditions between low and middle 
socioeconomic classes do not affect hemispheric lateralization for 
speech perceptioi^. The absence of a right-ear advantage for females 
replicates the outcome of several other investigators and points to 
the need for longitudinal rather than cross-sectional studies of the 
development of cerebral lateralization. 

' INTRODUCTION 

Several studies have indicated that children from low socioeconomic class 
(SEC) backgrounds may develop cerebral lateralization for speech perception at 
a slower rate than their middle SEC cohorts-. Kimura (1967), using a dichotic 
digits task to assess cerebral lateralization, reported that at age five both 
low and high SEC females and high SEC males -evidenced a right-ear advantage (in- 
dicating left-hemisphere specialization for speech perception). Low SEC males, 
however, did not evidence a right-ear advantage until age six. Geffner and 
Hochberg (1971), also using a dichotic digits task, reported a significant 
right-ear advantage for middle SEC children aged four through, seven, but found 
for low SEC children a significant right-ear advantage only at age seven. 
Taken togethe;r, these data suggest that some, as yet unspecified, environmental 
rearing conditions may retard the onset of cerebral lateralization of function. 

Recently Dorman and Geffner (1974) have provided another interpretation of 
the atudies reported above; the reduced right-ear advantage of the low §EC 
children on the dichotic digits tasks may be d^^e^o their overall poor performance. 
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or, in other t^ords, to a- "floor effect" reflBCting task difficulty and l^ck of 
motivation. To assess this hjrpothesis, six-year-old blacto and white children 
from low and middle SEC backgrounds were tested with a simplified dichotic lis- 
tening task (one monosyllable pair on each trial) by an experimenter of the sub- 
jects' own race. All groups Evidenced a significant right-ear advantage. 
Furthermore, the magnitude"' of the right-ear a^atitage did not differ as a func- 
^tion of race or SEC. This outcome suggests that low SEC children, at least by 
age six, do nof lag behind middle SEC children in development of hemispheric 
lateralization for speech perception. 

It is, of cours'e, possible that rearing conditions may exert an influence 
on cerebral lateralization of function, but that such effects' are detectable 
only in children younger than age six. To' asses's* thiB, in the present study, 
four-year-old children from low and middle SEC'backgrounds were tested on a 
dichotic' syllable task, s , ' • - 

METHOD 

Subj ectg 

The subjects were 44 four-year-old children: 21 low SEC (9 males and 12 
females) and 23 middle SEC (11 males and 12 females). Socioeconomic class was 
determined by Hollipgshead's Two Factor Index, of Social Position (Hollingshead, 
19S7X which takes inta account the parents' educational level and occupational 
status. All subjects were right-handed (handedness tasks are detailed in the 
Procedure) and had ^normal hearing' with no known perceptual, neurological, speech, 
or language deficit. Children with bilingual background were not selected. 

Apparatus 

The speech signals were recorded and reproduced on a Panasonic RS2^9US 
. stereo tape deck. The signals were presented to the children via matched and 

calibrated TDH-39 headphones. The output of each tape channel was monitored by 
^a 1000-Hz calibration signal on each channel. Audiometric threshold tests were 

administered on a Maico MA-10 portable audiometer calibrated to International 

Standards Organization (ISO) measures. 

Preparation of Stimuli ' , * ^ 

Six stop-consonant-vowel syllables /ba, da, ga,, pa, ta, ka/ were generated 
on the Raskins Laboratories parallel-resonance speech synthesizer. Each stimu- 
lus was 300 msec in duration. Under computer control the six syllables were com- 
bined into their 15 possible pairs and recorded dichotically, in a fully bal- 
anced order, on magnetic tape. The resulting tape contained 60 syllable pairs 
with each thember of a pair occurring twice on each channel. The inters timulus 
interval was four seconds. 

Procedure 

Each subject was tested individually in a quiet room. All subjects were 
first given an audiometric threshold test. Hearing level at 500, 1000, 2000, 
and 40dO Hz was assessed. If ^ the hearing level between the two ears differed by 
10 dB or more for two of the test frequencies, the subject was excluded from 
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further testing. Handedness was determined by a&fctng the subjects to perform 
three manual motor tasks: throwing a ball, cutting with scissors, and drawing a 
circle. Any subject who did not perform all three tasks with his right hand was 
not tested further. 

After the preliminary examination, the subjects were presented binaurally 
with three repetitions of the syllables /ba, da, ga, pa, *ta, ka/. The subjects 
were instructed to listen with both ears and report the syllable heard. Any 
subject unable to repeat the six syllables after the third repetition of the 
list was excluded from further testing. The subjects were then instructed to 
listen again with both ears and report the syllable they heard. (Since the sub- 
jects were not told that there were two different stimuli on these and the fol- 
lowing dichotic trials, only one response was elicited.) The subjects were tdid 
that these sounds would sound "funny," but to continue reporting them as before. 
The subjects were then presented the 60-item test sequence, followed by a brief 
rest, then the 60-item test again. To coatrol for possible channel effects, the 
headphones were reversed after the first 60-item test. 

RESULTS 

A subject's results were excluded from the data analyses if he/she did not 
complete the 120 trials or if he/she gave persevetative responses (the same syl- 
lable on most trials). On these criteria, the test results of , 48 percent of the 
low SEC subjects were excluded from the data analyses. In contrast, none of the 
data from the middle SEC subjects was excluded. 

Each subject's performance was scored in terms of the metric (R-L/R+L) x 
100, where R is the total number of syllables (Correctly recalled from the right 
ear and L is the total number of syllables correctly recalled from the left ear. 
The mean scores for the two socioeconomic class groups, as a function of sex, 
are shown in Table 1. 



TABLE 1: Average magnitude of the right-ear advantage in terms of (R-L/R+L) x 
100 for male and female subjects from low and middle SEC groups. 



Low SEC Middle SEC 



12.53 


12.00- 


0.60 


-0.83 



ERIC 



The magnitude of the right-ear advantage did not differ significantly be- 
tween the low and middle SEC groups (z = 1.02, £>.05). However, an overall sex 
effect was observed (z 2.07, £<.02), with males evidencing a significantly 
larger right-ear advantage than females. 

The mean number of syllables correctly repbrted from each ear for the male 
and female subjects in the low and piiddl'e SEC groups is shown in Tabl^ 2. Both 
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low and ^middle SEC male subjects evidenced a significant right-ear advantage. 
Neither the low nor middle SEC female subjects evidenced a significant right-ear ' 
advantage. Sixty-six percent of the female subjects, collapsed over SEC, evi- 
denced a right-ear advantage (mean ■ 11122) and 33 percent a left-ear advantage 
(mean = 23.51). Of the male subjects, 65 percent evidenced a right-ear advan- 
tage (mean 22.04) and 35 percent a left-ear advantage (mean = 6.41). Thus, the 
male subjects evidenced both larger average right-ear advantages and smaller 
left-ear advantages than the female subjects. 



TABLE 2: Mean number of syllables correctly reported from each ear. 



Group 


N 


Left 


Right 


t 


Male 
Middle SEC 


11 


32.82 


41.73 


2.05* 


Female 
Middle SEC 


12 


41.08 


38.75 


-0.32 


Male 
Low SEC 


9 


29.44 


39.33 


1.94* 


Female 
Low SEC ^ 


■12 


35.58 


36.41 


0.20 



*£< . 05 ; one- tailed , 



DISCUSSION ^ ^ - 

The presence of a similar right^ear advantage in both low and middle SEC 
four-year-old children replicates the outcome of, an earlier study with six-year- 
olds (Dorman and Geffner, 1974). Thus, variation in rearing conditions, at 
least for the range subsumed under the categories low and middle SEC, does not 
appear to affect the rate of cerebral lateralization for speech perception (cf. 
Ingram, 1975). This conclusion, must, howevet, be tempered bjl* the fact that a 
large number of low SEC children could not- be tested with the dichotic syllable 
task. 

The absence of a; Significant right-ear advantage in females was unexpected. 
However., several other investigators j in spite of their different dichotic lis- 
tening tasks, have reported a similar outcome with four-year-old females. 
Ingram (1975), Nagafdchi (1970), and Yeni-Komshian^ have all found significant 
right-ear advantages for four-year-old males, but not for females. While one 
sudh effect may reasonably be attributed to sampling error, the similar outcome 
of four independent studies strongly suggests that females, aged four years, do 
indeed perform differently than male coevals on dichotic listening tasks. 

One possible Ireason for this male- female difference, namely, that more fe- 
males than males give left-ear advantages, is ruled out by the present study, 

/ 
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since the proportion of subjects displaying a left-ear advantage was roughly the 
same for both sexes. Rather, an average right-ear advantage for females was 
eliminated because, * while a majority of the children evidenced moderate right-ear 
advantages* the remainder displayed large left-ear advantages. Thus, the ab- 
sence of an average right-ear advantttge for the females does not imply that in- 
dividual females are not lateralized. On the contrary, the absolute ear advan-r 
tage (ignoring direction) for the males and females was essentially identical 
(females « 15.54; males ■» 16.67),. 

Ingram (1975) has po'inted out that the absence of an overall right-^ear ad- 
vantage in four-year-old females is all the more puzzling, given that three- and 
five-year-uold females do display a right-ear advantage. Ingram noted one pos- 
sible interpretation of these , results—that there may be a period of cerebral 
reorganization, during which left-hemisphere functions are temporarily preempted 
by functions other than speed). Other interpretyittions are, of course, possible. 
For example, changes in magnitude of the ear-advantage may reflect changing 
linguistic processing strategies, similar to thosei demonstrated by Sever (1970) 
in sentence processing tasks. In any event, cross-sectional developmental 
studies, with their inherent problems of sampling error, are clearly inadequate 
to the task: a longitudinal cohort study, (cf. Schaie and Strother, 1968) seems 
to be needed to resolve^hese issues. 
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Automatic Segmentation of Speech into Syllabic Units* 
Paul Mennelstein 

ABSTRACT 

As a first step toward automatic phonetic analysis of speech, 
one 4esires to segment the signal into syllable-sized units. Experi- 
ments were conducted in automatic segmentation techniques for contin- 
upus, reading-rate speech to derive such units. A new segmentation 
algorithm>si^ described that allows assessment of ^ the significance of 
a doudness-minimum to be a potential syllabic boundary from the dif- 
ference between the convex hull-^Q^f the loudness function and the 
loudness function itself. Tested omrdughly 400 syllables of contin- 
uous text, the algorithm results ln^6,9 percent syllables missed and 
, 2.6 percent extra syllables relative^ to a nominal, slow-speech sylla- 
* ble count. It is suggested that inclusion of alternative fluent-form' 
syllabifications for multisyllabic words and the use of phonological 
rules for predicting syllabic COTta? actions can further improve agree- 
ment between predicted and experimental syllable counts. 

INTRODUCTION 

Automatic photvatio; analysis, of speech, such as that carried out as part of 
a continuous speech understanding system, requires a mapping from acoustic sig- 
nal to phonetic segments whose direct implementation has eluded speeth research- 
ers for many years. Liberman (>1970) reviews the case for considering the con- 
version between phone and sound to be a process of complex grammatical recoding 
that may prevent one from eVer finding a direct replacement of sound segments by 
phones. In ^reement with that point of view, we consider an alternative, in- 
direct approach that segments the speech stream into syllable-sized units and 
defcodes the phonetic segments of those units by considering tlje acoustic infor- 
mation contained in the entire syllable (Mermelstein, 1975), This paper pre- 
sents results of experiments in automai4:ic segmentation of continuous sjpeech into 
such ^syllable-sized units. ^ 
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The sylfable has been defined linguistically as "a sequence of speech 
sounds Vhavln^' a inaximiuni or peak of inherent sonority (that is, apart from fac- 
tors such as *iB tress anjd voice p^J^tch) between two minima of sonority" (Robins, • 
1966) • To arrive at ^an operational definition that can be Implemented computa- 
tidtially, ^ne musjt define sonority-^ terms of physical meafsures on the speech 
signal. This requirement leads quickly to the- realizatipn that "inherent sonor- 
ity" cannot be empirically defined because the same par^eter — intensity — sig- 
nals (in part) both sonority and stress, and the division between the two fac- 
tors is leather arbitrary. The argument that stress valueis are assigned to en- 
tire syllables and sonority varies from phone t^ phone, within the syllable can- 
not be applied to separate tlje two factors since it is precisely the operational 
determination of syllables that we are trying to achieve. 

Stowe (1963) attacked this'.problem by a hierarchic series of segmentation 
procedures, each operating on a 'different time function computed from the speech 
signal. Sargent, Li, and Pu (1974) also used two functions for syllable detec- 
tion, one measuring peak-to-peak atdiplitude^ the other root-mean-square (RMS) 
intensity. In this work we explore anew approach. .We attack the resolution of 
the above problem by defining a "loudness"' measure for th^ speech signal, a 
\:ime-smoothed and f requency^weighted summation of its energy content. Relative 
loudness maxima are interpreted as potential syllabic peaks and relative loud-^ 
ness minima as potential syllabic boxmdaries. To differentiate between sylla- 
bles generally defined on the phonological level and- the speech segments that may 
be located in the signal by phonetic criteria, we introduce the term "syllabic 
unit?*' for the syllable-^ zed speech segments that are to be found automatically. 
Boundaries located by loudness criteria do not necessarily segment the speech 
signal at points that can be identified as phone boundaries, or even word bound- 
aries. The syllabic tmlts are found to depend strongly on the phonetic perfor- 
mance of the speaker; in fact, they serve .to describe that performance by group- 
ing segments into larger units that generally form units of production as well. 

In order to arrive at a segmentation of the signal into syllable-sized 
units, we find that one must define a measure of significance that permits 
classifying loudness minima as to whether they denote actual boundaries. Other- 
wise, the ntunber of realized segments greatly exceeds the number of syllables 
one would count perceptually. Further, the measure of significance must be a 
futiction of the context of any particular loudness minimum. A local loudness 
minimum separated by less than 100 msec frOnf another i,ocal minimum with lesser 
loudness may be insignificant, yet the same minimum with no other minima within 
500 msec would generally signal a syllabic boundary. 

The significance of loudness maxima must be similarly evaluated.- In order 
to prevent segmentation into fragments that do not contain adequately string 
vSyllabic peaks, we reject any segment whose loudness max'imum is more than \a 
given threshold below the overall loudneiss maximum, the syllabic peak of the 
loudest syllable of the utterance. Similarly, a miMmum syllabic-unit duration 
of 80 msec is imposed, and segmentation that would result in shorter f^ragmetits 
is rejected. \ 

One important application of syllabic-unit segmentation Is as an aid to 
lexical analysis where one would like the same text spoken by different speak- 
ers to show at most a small number of alternative syllabic-unit representations. 
Fricatives are generally not tightly bound to the syllabic units with which they 
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are associated but are frequently .separated from them by a short interval''of 
weak voicing oi: even silence. On the basis of loudness criteria alone, they 
form* valid syllabic units* ^For the purposes of evaluating the results of our 
segmentation procedures and for accessing a lexicon o£ syllabic forms, we ^re- 
quire that syllabic units have nonfricative- nuclei. If subsequent analysis re- 
veals that a syllabic unit manifests significant frication near the syllabic 
peak, it is labeled as a syllabic fragment and not counted as an indep&nden|^ 
syllabic unit. 

SEGMENTATION PSING A CONVEX-HULL ALGORITHM 

In order that our empirically determined loudness function roughly approxi- 
mate the subjective loudness function, loudness is obtained from the speech 
power spectrum by weighting frequencies below 500 Hz and above 4 kHz according 
to a function that drops off at 12 dB/octave outside these frequencies. To 
jBlimlnate variations in loudness due to the phase of.^he fundamental frequency 
of excitation, the loudness function is low-pass filtered at 40 Hz. Our imple- 
mantation computes loudness from the short- time power spectrum, hut it could be 
equally well derived by directly filtering the speech wave. 

Initially, a segment of speech between apparent pauses (silent interval ex- 
ceeds 200 msep) is selected for analysis. The "convex hull of the loudness 
function is defined as the minimal -magnitude function that is monotonically non- 
decreasing from the start of the segment to its point of maximum loudness, and 
±3 monotonically nonincreasing thereafter. Within the segment, t^e difference 
between the convex hull and the loudness function serves as a measure of signif- 
icance of loudness minima. The point of maximal difference is a potential 
boundary. If the differeixce there exceeds a given threshold, the segment is 
divided into two subsegments. 

Segmentation is carried out recursively. The convex hulls newly computed 
for the subsegments nowhere exceed the convex hull of the original segment, 
tfence, after any segmentation step, only less significant minima remain. If the 
maximal hull-loudness difference within the segment is below the threshold, no 
further segmentation of that segment is attempted. The algorithm makes use of 
the loudness context implicitly by extracting minima in order of significance. 
Thus, a minimum may not be significant if there is a more significant one close 
by. Segmentation removes the more significant minimum and allows reconsidera- 
tiori of the signj^f icance of the other minimum. 

Figure 1 illustrates how the implementation of the convex-hull algorithm is 
applied. An original speech segment over the interval (a-c) is found to possess 
a loudness function £,(t) with a maximum at point b. The cottvex-hull computed 
for the segment (a-b-c) is hi(t). Over the ^.nterval (a-c), the maximum hull- 
loudness difference is at c'. If dj^ exceeds the threshold, segment (a-b-c) 
is cut up into segment (a-c') followed by segment (c'-b-c).. The hull for seg- 
ment (a-c'), defined around the new maximum point b' , follows the loudness curve. 
Tliis results in a zero hull-loudness difference over that interval and that por- 
tion is not segmented further. The hull for segment (c'-b-c), denoted by h2(t), 
is &hown by the short dashed line where it differs from hi(t) over the segment 
interval. The new maximum ]hull-loildness difference is found to be d2. If ^2 
does not exceed the threshold, then the segment (c'-c) is not divided further. 
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The algorithm does not proceed from left to right in time.' It assumes that 
the entire litterance is stored before processing commences, but requires only 
that a complete segment delimited by silent intervals be captured before seg- 
mentation starts. Where real-^tlme operation is essential, the algorithm can be 
modified to operate from left to right with possible backtracking over an inter- 
val of no larger than the maximum syllabic unit interval, roughly 500 msec. 

Experimental Results 

The 'performance of the algorithm was evaluated by processing 11 sentences 
read by each o£ two male subjects at their comfortable reading rate. The first 
six sentences (text A) make up the well-known "Rainbow Passage," and contain 
both monosyllabic and multisyllabic words. The last 5 (text B) consisted of 
only monosyllabic words and were taken frotQ material composed by Lea^ (1974). 
The differentiation in text material was utilized to explore the dependence of 
segmentation errors on the frequency of multisyllabic words in the text. 

Figure 2 illustrates typical results for the text "...a boiling pot of gold 
^t...." The segmented loudness function is plotted above a computer-generated 
spectrogfaphic representation of the utterance. The spectral data have been 
preemphasized at 6 dB/octave above 300 Hz. Use of a uniformly weighted inten- 
sity for the loudness function would miss the high-frequency energy discontinu- 
ity for [boj - lig]. By using loudness as defined, high-f requenCy energy varia- 
tions are emphasized and the boundary is located. 

By varying the segmentation threshold parameter d, we can control the rela- 
tive frequency of e^^tra syllabic units found and the frequency of syllables 
missed. A threshold d = 0 will result in too many extra syllabic units due to 
segmentation even at pointy of minimal variation in the speech loudness. A high 
threshold, d > 3 dB, will result in many significant segmentation points within 
voiced segments being missed. The segmentaticn results at d values of 2 dB, as 
compared to 1 dB, showed that 12 extra syllables inrthe corpus of 418 syllables 
had beeti eliminated, and only two new missed syllables had been introduced. 
Further small increases in the value of d did not result in any appreciable dif- 
ference in performance; therefore, all further results are given for the d = 2 dB 
condition. 

Differences between -the, output of the segmentation algorithm and a nominal 
syllable count are given in Table 1, classified by category and speaker. Since 
the syllable qount is dependent on fricative detection, errors resulting from in- 
correct fricative detection are indicated separately, denoted by categories E2 
and M2, respectively. The major source of extra syllabic units was in prepausal 
position (category El) where significant release gestures were associated with 
final scops and liquids. The syllabic-unit loudness peaks for "Riese cades were 
well above the -25-dB syllabic peak threshold, a value arrived at by empirical 
adjustments to eliminate most syllabic fragments. The frequency of prepausal 
extra syllabic units was highly speaker dependent, 1.2 percent for subject LL, 
only 0.5 percent^ for GK. 

Syllabic units were missed primarily because of the* tendency of an un- 
stressed and stressed syllable-pair to" contract into one stressed syllabic unit 
(category Ml). Most such junctures had a loudness minimum that was not less 
than 1 dB below the last convex hull computed, some in fact had no loudness 
minima a^isociated with them at all. In the monosyllabic text B, such errors 
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TABLE 1: Differences between algoritlm-derived syllable . 
Crount and manual slpw-speech count • , 

. Difference category 



Speaker 


El 


.E2 


' Ml 


M2 


Total Syll. 




Tex,t A ' B 


A B 


A B 


A* B 


/ A B- 


LL 


1 1 


1 - 


12 2 


2 2 


123 .86 


GK 


1 4 


2 1 


5 3 


2 1 


123 86 



were encquntered mostly in open syllables, e.g., /so aj/, /hi had/, but their 
frequency was rather low (0.6 percent). Possible contractions ac^ross words that 
may result in a syllable count for a sequence of words that is smaller than the 
sum of tlie individual syllable counts may best be handled on the phonological 
level through a set of rules predicting such phenomena. In the multisyllabic 
text the frequency of syllables missed was significantly hi|^r.(1.4 percent). 
Many of these (10 of 22) were encountered in both speakers* productions; that is, 
the syllable count for the same word or words for both speakers was consistent 
but different from a nominal syllable count that one would expect in slow speech. 
Typical examples are [hraj-zon] ^and [ap-pem-li] for "horizon" and' "apparently." 
^ These forms must be considered to constitute acceptable productions alternative ' ' 
to tho^e that would cont&in an additional syllabic unit for each wbrd. Our re-' 
suits suggest a frequency of occurrence of problematic words whose syllabifica- 
tion cannot be adequately treated on the phonetic level. For speech recognition 
applications, it seems advisable to handle these multisyllabic word problems on 
the lexical level by including alternative admissible syllabifications in any 
lexicon of syllabic forms. 

Differences categorized E2 and M2 denote extra*and missed syllabic units 
due to incorrect categorization of the unit as nonfrio^tive and fricative, re- 
spectively. Missed units result if a short vowellike interval is missed and 
the unit is Interpreted as completely fricative, e.g., [t^], due to a previous- 
ly discussed decision not to count fricativelike syllabic units as independent. 
Extra units result if a voiced fricative shows voicing sufficiently strong that 
it is interpreted as vowellike. Presumably these errors could be eliminated 
through improvements in the 'fricative detection algorithm. 

In summary, the overall frequency of syll§ble-count differences with re— 
spect to nominal, slow-speech syllable count Was 9.5 percent, consisting of 
6.9 percent missed and 2.6 percent extra. We have previously reported 2.7 per- 
cent boundaries missed and 9 percent extra syllabic units found by essentially 
the same algorithm for a different text of 430 syllables (Mermelstein and Kuhn, 
1974). There the algorithm was not optimized for the value of threshold d and 
errors were counted relative to a perceptual syllable count on the same spoken 
material. The difference in missed boundaries arises from the difference in the 
standard of comparison. For lexical applications a maximal syllabic form 
appears as the most useful standard. 

The two sets of results, those reported here and those previously reported 
(Mermelstein and KuHn, 1974), carried out on data from different speakers and 
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collected under different recording" conditions, yield roughly comparable differ- 
ence rates. The previous study used data on a total of 31 sentences recorded by 
sojne five speakers, three male and two female. No large differences in overall 
syllable-count difference rates are observed as long as the speech is spoken 
carefully and at a moderate rate. 

LOCATION OF SYLLABIC-UNIT BOUNDARIES 

The boundaries located by the algorithm do not bear a simple relationship 
to general syllabification rules followed in "phonological" syllabic boundary 
assignment where the main criterion appears to be whether words occur in the 
language with that particular initial ot final, cluster . Based on these criteria, 
the syllabification of words containing intervocalic nasals and liquids is gen- 
erally ambiguous, the sonorant may be assigned to either syllable- initial or 
syllable-final position. Linguists generally assign the maxfmum initial conso- 
nant sequence to the stressed syllable (Hoard, 1971). The algorithm locates a 
boundary within the consonant roughly at the point of minimal first-f ormant fre- 
quency. The major part of the consonantal segment is generally found assigned 
to the syllable carrying heavier stress due to its greater loudness. Where allo- 
phonic variations are associated with syllabic position, e.g., [y'aund] vs. 
[a'raund], the syllabification resulting from use of the algorithm is generally 
consistent with our phonetic expectations. • 

The change in loudness or intensity at the onset of a syllable is generally 
more abrupt than at its end; thus there is less uncertainty about the onset-time 
of a syllable than about its termination. Therefore, silent segments or those 
whose loudness is below the noise threshold are arbitrarily assigned to syllable- 
final position. This results in inclusion of nonreleased final stops in the 
previous syllable, but released stops straddle the syllabic boundary. 

Intervocalic clusters are generally divided up. Compounds such as /sAnlajt/ 
are segmented in accordance with morphemic criteria as loudness is found to de- 
crease over the nasal and to increase over the liquid. Initial or final clus- 
ters may however be frequently broken up by the syllabification when unstressed 
syllables precede or follow them. For example, /tu/+/grit/ may map to 
[tug rit], or /pajlz/+/of / to [pajl - zaf ] , where the symbol - indicates the 
position of the boundary within the phonetic segment stream. Generally the 
effect is to couple an initial or final cluster constitupnt with the preceding 
or following syllable if that ends or starts with a vowel. These effects occur 
sufficiently consistently, at least in our limited data, so that syllable reor- 
ganization may be predictable by rules. 

The algorithm provides a useful tool for phonetic analysis. The word-pair 
/rezd/ vs. /redz/ forms an interesting example where attempts to use phonologi- 
cal criteria such as the measure of "vowel affinity" proposed by Fujimura (1975) 
to Constrain the admissible syllable structures in English break down. Here /z/ 
at|^ /d/ are phonemes that may occur in either order in syllable-final position, 
an exception to a general. ordering of phonemes by increasir\g "vowel affinity" in 
syllable-initial and decreasing "vowel affinity" in syllable-final position. 
The convex-hull algorithm invariably classifies /rezd/ as one syllabic unit and 
/redz/ as two, a prop6r unit followed by a syllabic fragment^. The "Vowel 
affinity" of the fricative is different in the two cases, as manifested by a 
difference in intensity of voicing. The fricative in /redz/ is but weakly 
voiced, frequently devoiced. The postvocalic /z/ preceding a voiced stop carries 
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stronger voicing.* When followed by an unstressed vowel in syllable- initial 
position, this difference manifests itself by an assignment of the fricative to 
the first syllabic unit in the case where /rezd/+/in/ gives [rezd - din] (syl- 
labic-unit boundary within the closure of the /d/) , but to th6 second syllabic 
unit in /redz/-f/in/ mapping to [red - zin] . We conclude that for the purposes 
of phonetic analysis, information derived regarding syllabic units and fragments 
is in fact useful even though for syllable-counting purposes on^ may desire to 
minimize the number of such fragments. 

Conclusions 

! 

Syllabic units can be counted in continuous speech by simple automatic 
techniques. The number of syllabic units found will agree relatively reliably 
with a text-derived syllable count under the following conditions: 

1. ""The algorithm is tuned to minimize extra syllabic units and 

missed units by adjusting the significance-threshold d. 

2. A moderate amount of postprocessing is performed to weed put 
fricativelike syllabic fragments because they do not constitute 
independent syllabic units. 

ft 

3. Phonological rules are employed to predict where separate words 
may be contracted to reduce the syllabic count of the total to 
less than the sxm ofv the individual, counts. 

4. Alternative fluent-production forms artf recognized for many 
multisyllabic words. 

Segmentation into syllabic units appears to be suf l^tciently consistent so that 
the units so delimited constitute appropriate units of the speech signal on 
which further analyses may be carried out to extract additional phonetic^ infor- 
mation. 
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ABSTRACT 

There is voluminous evidence that homorganic stop consonants are 
distinguishable on the basis of voice onset time (VOT) relative to 
their supraglottal articulation. For initial stops, a convenient 
acoustic reference pijint is the onset of the release burst, and VOT 
• has been defined as the interval between this point and onset of 
glottal signal. Vo'ice-onset-tlme boun4ary values between voiced and 
voiceless initial stops of English have been established by spectro- 
graphic measurements of naturally produced isolate4 words and by per- 
ception testing of S3mthesi2ed consonant-vowel (CV) syllables. The 
close match between the two kinds of boundary values suggests that 
fairly natural values were chosen for the invariant features of the 
synthetic speech patterns tested. It is known, however, that certain 
• ^of these affect voicing perception. New data from ^synthesis experi- 
menits show that VOT boundaries shift with changes iw^transition dura- 
tion, that the first formant and not higher ones are responsible, and 
that transition duration is constrained to values that differ for 
place of articulation. , 

There is a good deal of evidence that homorganic stops are distinguishable 
on the basis of voice onset time (VOT) relative to their supraglottal articula- 
tion. For initial stops a convenient acoustic reference point is the onset of 
the release burst, and VOT has been defined as tTie interval between this point 
and the onset of glottal signal. »Vpice-onset-tlme boundary values between 
English initial /b,d,g/ and /p,t,k/ have been determined by spectrographic mea- 
surements of naturally produced isolated words and by perception testing of syn- 
thesized CV syllables. The close match between the two^kinds of boundary values 
suggests that fairly natural values were chosen for features of the synthetic 
speech patterns that were not subjected to experimental manipulation in the 
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.perceptual experiments. It has be^ known for some time, however, that certain 
of these features do affect the per|^pt4.on of stop^ voicing. Thus, Cooper, 
Delattre, Llberman, Borst, and GerstmaiT (1952) pointed to a rising first for- 
mant as a cue to voicing, and the same group (Llberman, Delattre, and Cooper, 
1958) later singled out f Irst-f ormant "cutback" as a formidable cue to the ^ 
English voiceless stops. More recently, Fujimtira (1971) and Haggard', Ambler, 

* and Callow (1970) have shown that fundamental frequency can al^o serve as a cue/ 
'to the stop-voicing contrast. Most recently, Stevens and Klatt (1974) have 
emphasized the role of transition duration,' showirfg that with greater durations 
there, is a^ increase in the VOX value at the boundary between sjmthetic /da/ and 
/ta/ syllables. From all these studies, it is .clear that listeners do not 
attend exclusively to VOX in judging synthetic stop-vowel patterns. (Here "VOX" 
signifies th^. duration of the interval between burst onset and the. time when the 
periodic signal source is switched 6n and the first formant simultaneously 
shifts from zero to full amplitude.) Stevens and Klatt' (1974) have argued that 
listeners, at least a significant proportion of them, respond in categorical 
manner to the presence versus, absence of rapid frequency shifts in the formants, 
particularly the first formant following the onset of voicing. Xhls not only 
accounts for their data, but also, as they point out,' serves to explain. why VOX 
boundaries vary with place of stop closure, since it has been observed that 
burst and transition durations also vary wrth place in' naturai speech. Xhe 
Stevens-Klatt theory emphasizes the fact that there is another temporal land- 
mark, aside from burst and voicing onset, that may havje perceptual Importance 
for stop voicing; namely, the point where formant frequencies achieve values 
appropriate to the following vowel. It might be the case, they seem to be say- 
ing, that the choice of the burst as the reference pblnt for measuring VOX is 
more^ a matter of visual convenience for the spectrogram reader than of. selecting 
the most useful landmark for the human auditor. At least two questions may be 
raised here: (1) Is the Stevens-Klatt 'hypothesis the only one suggested by their 
data? (2) Is their proposed new measure of VOX more nearly suf flclelit than VOX 
as a basis for categorizing the stops? 

Xo help answer these questions, we can examine spme new data that show,- to 
begin with, that the Stevens-Klatt (1974) finding is in fact repllcable. 'In 
Figure 1 we have the responses of seven phonetically naive young talkers of 
English asked to label as either /da/ or /ta/ a set of appropriately designed 
synthetic speech patterns. Xhe variables are VOX and transition duration. 
Voice onset time was varied in 10-msec steps from 5- to 65-msec delay in on^et 
of pulsing and first formant relative to the burst. In Xest I six transitions,' 
ranging from 2Ck to 85 msec in duration, were used; in Xest II the durations 
ranged from 40j to 115 msec, this last value being the greatest for which accept- 
able /da/ and j^ta/ syllables could be heard. Just as Stevens and Klatt' found, 
the 50 percent crossover points along the VOX dimension move to higher values 
with increasing transition duration. Xhe crossove"^ for the shortest transition 
tested differs from the one for the longest by slightly more than 25 msec. Xhls 
shift is just about twice as great as that reported by Stevens and Klatt. Of 
course the 25-msec shift shown here is occasioned, as we see, by a ^change of 
95 msec in 'transition duration; in the^^5>gj:evens-Klatt experiment ^he transitions * 
were varied by only 30 msec. Insofar as they are comparable, our data show the 
closest possible agreement with theirs. However, on the* basis of our data, it 
is as easy to emphasize the stability of the VOX boundary in the face of 'an ex- 
treme change in transition duration as it is to point out the undeniable fact 
that it is not absolutely Immutable. 
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Transition Duration and VOT /da/ vs /ta/ 
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Figure 1: Labeling responses of 7 S^s (4 trials) in a forced-choice task. 

Fifty-six stimuli (7 VOT values x 8 transition durations) with bursts 
and transitions appropriate to /da/ and /ta/ were presented^-^rtr two 
tests to reduce subjectr^tatigue. 
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Now let us ask again whether these data, and the Ste^yens-Klatt data as 
well, point unequivocally to the duration of the voiced rather than the unvoiced 
transition a^ the feature that determines listeners* labeling Judgments* In 
Figure 2 we have represented schematically the f irst-formajit tr'i^ectories of our 
test stWuli. For each transition duration, at VOT « +5, the first-formant fre- 
quency rises linearly from an onset of 154 Hz to a steady-state value of 769 Hz. ^ 
Since in general the f irst-formant intensity is zero until the periodic source 
in our synthesizer is turned oti, for- VOT values greater than +5 msec, the actual 
.onset frequency of F^ depends directly on VOX* Thus for a transition of 20-msec 
duration, the F^^ onset frequency at the VOT crossover value ia about 620 Hz. In. ^ 
the display, «the. F^^ trace is a solid line to the right of the VOT crossover. To 
the left of that value the dashed line indicates the absence of acoustic energy 
lat the ?2. frequency, while the higher formants, not shown, are excited by the 

mdom ftoise source of the^ synthesizer. We see that with increasing transition 
.duration not only is there a rightward shift in VOT crossover, but also that 
*^here are changes in F^ onset frequency and in the duratle^ of the transition 
following onset of the periqidic excitation^ These relations are more easily 
seen in Figure 3. The upper panel indicates how F]^ onset frequency, ot alterna- 
tively the extent of F^^ dhift following voice onset, varies at the VOT "boundary 
with changing transition' duration. Given the limitations of the experiment, the 
two curves of course gay exactly the same thing, and pending fiirther work we 
cannot say which measure is more relevant perceptually, or indeed how much mean- 
ing either of them has independently of the purely temporal measures of voicing. 
Perhaps we might for now suppose that as VOT is increased, either the Fj^ onset 
frequency must be lowered or the extent of its frequency shift increased in 
order to achieve a stimulus that is ambiguous, as between /da/ and /ta/; that 
is, one or the other df these changes serves to counterbalance the devoicing 
effect of prolonging the delay in voice onset. 

* 

The lower panel of Figure 3 suggests an answer to the question of whether the 
measure of VOT (here labeled "V^D** for "voiced transition duration") proposed by 
Stevens and Klatt (1974) provides a more stable index of stop Voicing than does 
the usual measurement of VOT. tf in fact it is true that listeners pay more 
attention to the transition following voice onset than to the preceding voiceless 
interval, then their measure ought to yield a curve of smaller slope than the 
standard VOT measure. This is clearly not the case here. We conclude, with 
Stevens and Klatt, that VOT is n^t alone sufficient to explain our listeners' 
behavior, but thatVTD, their proposed measure, is even less adequate, by itself, 
to account for that behavior. 

So far we have b&eti talking about fbrmant transitions as though, only the 
first formant deserves attention in a discussion of stop voicing. To see 
whether this Is Justifiab*le, we ran^ a second experiment in fwhich, along with 
VOt, the transitiort duration of the first tormant was varied independently of 
the two higher formants. Transition durations of 15 and 80 msecs were used, and 
t^s^ were assigned to our test stimuli as shown on the left-hand side in 
Figure 4. The labeling data on the^right-hand side of Figure 4 show that lis- 
teners* responses were detexrmined almost entirely by the F-^ transition. With a 
short F]^ transition the effect of varying the - higher formants is nil. With the 
longer F^^ transitions the higher formapts have some effect, but that effect,, 
measured as a shift in VOT crossover, is considerably smaller than the effect of 
a change in F£. The effect of transition duration on stop voicing is then pri- 
marily attributable to the first; fermant;*^ 
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DURATION OF TRANSIT ION (MSEC) 

The same data of Figures 1 and 2 are represented^ in the four curves 
shown. For th6 transition durations tested twice that yielded <iif- 
ferent VOT crossover values the curves show overall mean values; the 
short vertical lines indicate the magnitude of the differences 
between Test I arid Test II data. 
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To conclude then, the VOT boundary is not fixed; it varies directly with 
the transition duration. However, it is limited, appearing to lie between!^ limits 
of roughly 20 and 50 msec following the burst onset. The duration of the voiced 
transition at the'boimdary between /da/ and /ta/ also varies with^ transition 
"duration, and our data fail to give any indication of a limiting value for this 
feature beyond which /ta/ could not be heard. 



Cooper, F. S., P. C. Delattre, A. M. Liberman, J. M. Borst, and L. J. Gerstman. 
(1952) Some experiments on the perception of synthetic speech sounds. 
J. Acoust. Soc. Amer. 24 > 597-608. 

"Pujlmura, 0. (1971) Remarks on stop consonants: Synthesis experiments and 

acoustic cues. In Form and Substance: Phonetic and Linguistic Papers Pre- 
sented to El i Fischer- J^rgensen , ed. by L. L. Hammerich, R. Jakobson, and 
V E., Zwimer. (Copenhagen: Academisk Forlag) . 

Haggird, M. , S. Ambler, and M. Callow. (1970) Pitch as a voicing cue. J. 
Acoust. Soc. Amer. 47 , 613-617. 

Liberman, A. M. , P. C. Delattre, and F. S. Cooper. (1958) Some cues for the 
distinction between voiced and voiceless stops in initial position. Lang. 
Speech 1, 153-167. 

Stevens, K. N. and D. H, Klatt. (1974) Role of formant transitions in the 

voiced-voiceless distinption for stops. J. Acoust. Soc. Amer. 55 , 653-659. 



REFERENCES 




264 



281 ^ 



Some Masklngllke Phenomena in Speech Perception* 
M. F. Dorffimv\L. J. Raphael, A. M. Libennan, and B. Repp 



ABSTRACT 

To study ma^kinglike effects in the perception of continuous 
speech, listeners were 4)resented two-formant syllables /beb/ or 
/beg/ followed, at intervals from 0 to 150 msec, by /de/. The sub- 
jects were instructed to identify the syllable-final consonant. An 
80-msec intersyllable interval was required for recognition of the 
syllable-fdnal consonant to reach asymptote. To determine the ^level 
at which this "effect" occurred, we repeated the experiment, but with 
^ the second-formant transitions of the first syllable presented alone 

for judgment as rising or falling chirps. Recognition of the iso- 
lated transitions (chirps) was essentially unaffected. These data 
suggest that the "masking" in the initial study was due to the elim- 
ination of a necessary cue — in this case, a silent interval, corre-^ 
sponding to the stop closure between the syllables — and not to back- 
ward masking of the auditory information. A third study found t'hat 
changing voice between the syllables from male to feigale also elimi- 
nated almost all the "masking." This reinforces the conclusion of 
the first two studies, indicating in this case, that the effect is 
not to be attributed to interruption of information processing. 

It has been suggested recently that forward and backward masking m^y con- 
strain the perception of phonetic segments. -Thus, Massaro (1972) has proposed 
that "...the redundancy of a vowel in nb^nial speech. . .protects it from later 
speech until processing has been completed" and, in the same spirit, that "..v. 
if a consonant-vowel transition was followed by a speech sound that' could not 
be integrated with it, perception should be disrupted, and backward recogni- 
tion masking should occur." 



*This is a revised version of a paper presented at the 89th meeting of the 
Acoustical Society of America, Austin, Texas, 7-11 April 1975. 

"^Also Herbert H. Lehman College of the City University of New York. 

^Also University of Connecticut, Storrs, and Yale University, New Haven, Conn. 

University of Connecticut Health Center, Farmington, Conn. 
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In this r paper we will examine several cases of speech perception that fit 
the paradigms for forward and backward masking. Our purpose Is to see If the 
underlying processes are, Indeed, those of masking, either peripheral (Inte^ 
gratlon of target and mask) or central (disruption of processing). 

For the case of forward masking, we have chosen an Instance of a type 
long familiar to students of speech perception: to perceive the stop In a 
syllable-^lnltlal f rlcatlve-stop cluster, we must have a period of silence 
between the fricative noise and the start of the stop transitions. The syl- 
lables In our experiment are //pe/ and //ke/. In Figure 1 we have shown a 
schematic spectrogram .sufficient for the perception of //pe/. Seen In terms 
of the forward-masking paradigm, the noise of the fricative /// would be the 
mask and /pe/ (or /ke/) the target. In our first experiment we varied the 
silent Interval (let us call It the Interstlmulus Interval or ISI) between 
the /// mask and the /pe/ or /ke/ targets. The resulting stimulus patterns 
were randomized and presented to lA listeners for judgment as //pe/, //ke/, 
or^//e/. We see In Figure 2 that "masking" did occur: at ISIs of 20 msec or 
less the listeners reported hearing //e/, not //pef/ or //ke/. 

To gala some Insight Into the processes underlying the failure to heat 
the stop, we'mndertook a second experiment to determine If thete was. In fact, 
a masking of the essential acoustic cue — the second-formant transition — for 
the perceived distinction between //pe/ and //ke/. For that purpose, we 
followed exactly the procedures of the first experiment except that, in this 
case, the targets were not the syllables but only their second-formant transi- 
tions. These isolated. transitions are not heard as speech; rather, they sound 
like "chirps," which our subjects easily learned to Identify as "high" or "low. 
The outcome of this second experiment is shown in Figure 3. We see that cor- 
rect perception of the "chirps" was not noticeably affected by the /// mask. 
From that we infer that our subjects' failure to hear the stops In the first 
experiment was not due to maskitig in the ordinary sense. That is, the role 
of the silent Interval between the /// noise and the stop transitions is not, 
apparently, to avoid interference between target and mask. For a more rea- 
sonable interpretation we should note that in the production of, initial fric- 
atlve-stopclusters, closure must occur after the fricative, and therefore 
we may suppose that the silence caused by the closure Is an essential manner 
cue for the perception of the stop. On that interpretation, the silence pro- 
vides information, not freedom fr<§ta interference. ^ 

Let us turn now to the para41gm for backward masking and in that connec- 
tion consider the per-ception of the disyllables, /beb de/ and /beg de/. A 
schematic spectrogram sufficient for the perception of /beb de/ is shown in 
Figure A. As a case for backward masking, the syllable-final consonant /b/ 
in /beb/ is the target and the syllable /de/ is the mask. To determine, then, 
whether masking does occur in this case we varied the silent interval between 
the mask /de/#-and the targets /beb/ or /beg/, randomized the resulting pat- 
terns, and presented them to 13 subjects for judgment as /beb de/, /beg de/ , 
/be de/. The outcome is shown in Figure 5, We see th^t at ISIs of 50 msec 
or less the subjects reported hearing /be de/ — that is, they did not hear the 
syllable-final stops /b/ and / g/ • 

^ ' » 

To find out more about the underlying processes, we carried out for this 
paradigmatic case of backward masking an experiment analogous to the forward- 
masking experiment with the chirps. That is, we isolated the acoustic cue for 
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the perceived distinction between /beb/ and /beg/ — the second-formant transi- 
tions that, by themselves, sound like chirps — and, substituting them for the 
target syllables /beb/ and /beg/, we added the /de/ mask exactly' as it had 
been added in the experiment where it has , at short ISIs, effectively masked 
the stop-consonant targets. (The subjects 'were taught to identify thq "chirp'* 
as high or low.) The outcome is shown in Figure 6. We see that the correct 
perception of the chirps was little affected by the mask. This suggests that 
the processes underlying the failure to^ hear the stops was not due to masking 
of the cMLf ferential acoustic cue. 



Onpre again we might suppose that, as in the paradigmatic case of forward 
masking, the role of the necessary silent interval is to provide information, 
not time for processijig. As in the earlier* case of forward masking, that sup- 
position is reasonable on the basis that the disyllables /beb de/ and /beg de/ 
can be produced only if the speaker closes his vocal tract (thus producing an 
interval of silence) between' the end of the first syllable and the begil^njing 
of the second. In the case of backward masking, however, there remains a 
masking interpretation to which the results xd.th the chirps are riot neces- 
sarily relevant: it is possible that the mask (the /de/ syllable) interrupted . 
the phonetic (as opposed to auditory) processing of the target. It is dif- 
ficult to test that hypothesis directly, but the data of the next experiment 
, do bear on it . 

The next experiment was like the backward -masking case just described 
except that there were three targe ts-- /bab /, /bad/ , and /bag/~followed by a 
single mask /da/. The outcome is shown in Figure 7, where we see that the length 
of the necessary silent iriterval is different for fhe three targets; /bad/ in 
particular stands out, needing a much ledger silent interval than the other two. 
One ckn think of *no reason why the perception of syllable -final /d/ should re- 
quire so much more processing time than the other stops, which is the assump* 
tion that a masking-process interpretation demands. On the other hand, it is 
quite obviou^ that tjie normal production of the geminate /bad da/ requires a 
longer vocajf-tract closure than /bab da/ or /bag da/ (Delattre, 1971). Thus, 
this restflt points once again to the conclusion that silence is here a cue, 
a source of information. Moreover, it suggests, more compellingly perhaps 
than the earlier experiments, that phonetic percept;Lon is in this case con- 
strained not primarily by what the auditory system can do but by what vocal 
tracts can do. The auditory system could hear the chirps even at short ISIs, 
but no vocal tract ;can produce the geminated stops with such short closures. 

,0n the assumption that perception mighty here be obeying vocal-tract con- 
straints, we ask next: Whose vocal tract? Common sense suggests that it 
would not be that of the listener or the speaker nor yet o^ any other individ- 
ual, 'but rather some very abstract conception, which somehow takes account of 
what can and cannot be done by vocal tracts in general. In that connection we 
note that, as we have already remarked, a vocal tract cannot produce /bab da/ 
or /bad da/ wfthout closing between • the syllables and, as we have found, a 
listener cannot perceive both stops unless th^e is a corresponding interval 
of silence. But that applies only to a single vocal tract. Given two speakers, 
one can produce the first syllable and the other the second syllable with no . 
silent interval between. We thought it of some* interest to determine experi- 
mentally if the constraints that applied in the perception of utterances pro- 
duced by,»d single voice would apply equally when* perceptibly different voices 
produce the target syllable and the mask syllable. 
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In the last experiment, then, we duplicated the procedures of the earlier^ 
experiment in which we had varied th^ silent interval between 'target syllables 
/bab/ and /bag/ and a mask /da/, but in one case both syllables were spoken by 
a male while in the other the target was spoken by a male and the mask by a 
female. (Two test sequences were produced, one for the same-voice condition, 
the other for the different-voice condition. Within each sequence the patterns 
were randomized and presented for judgment of the syllable-final consonant Ihl 
and /g/, as in the earlier experiments. Ten subjects were told before each 
condition thatthe syllables either were or were not produced by the same voice 
The outcome is -shown in Figure 8. The solid line (labeled "male") is for the 
same-voice condition. We see much tlie same result that had been "found earlier: 
at relatively short ISIs the syllable-final consonants are not heard. The re- 
sults for the different-voice condition are shown by the dashed lines (labeled 
"female"). We see that eight of the listeners in the different-voice condi- 
tion correctly perceived the syllable-final consonants, even at the very short- 
est ISIs. Tto of the listeners performed in the* different-voice condition ex- 
actly as they had in the same-voice condition. (It is worthy of note that one 
of these two listeners spontaneously commented at the end of the experiment 
that she had not thought that the voices were different; she had assumed, 
rather, that it was the same person speaking at two different pitches.) 

ft Though many controls need now to be carried out, we shall tentatively con- 
clude on the basis of the last experiment that if phonetic perception is, in 
any case, constrained by what vocal tracts can and cannot do, the constraint is 
a very abstract one indeed. From all the experiments here reported, we s^all 
conclude, somewhat less tetitativeLy , that the role of silence before (or after) 
a stop is not to avoid interference (as between target and mask) but to provide 
important information. Putting that less tentative conclusion together with 
the more tentative one, we might say that the information is important because 
it tolls' the listener what a vocal tract is doing. 
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The Percept-ion of Vowel Duration in VC and CVC Syllables* 

+ + 
Lawrence^ J. Raphael, Michael F. Dorman, and A. M. Liberman 



ABSTRACT 

It is ^ well-established that consonanf^vowel (CV) transitions si- 
multaneously convey information about the consonant and the vowel;. One 
wonders then whether listeners can perceive separately the durations 
of consonant and vowel. This question is of some consequence, since 
it is known that "vowel" duration is a cue for the f perceived .dis tine- , 
tion between unreleased voiced and voiceless stops in final position 
in the syllable. Our aim in this study was to find out whether 
transitions convey information about vocalic duration. In order to 
answer this question, three pairs of ^jmthetic stimuli were generated. 
One member of each pair was a vowel-consonant (VC) syllable, the 
other was a consonant-vowel-consonant (CVC) syllable. All stimuli 
ended in transitions appropriate to unreleased /d/, and each con- 
tained formants, appropriate to the vowel /e/, which were varied in 
10-msec steps over a durational range of from 30 to 150 msec. The 
initial consonants of the CVC st'imiili in each of the three experi- 
ments were, respectively, , /d/, /r/, and./sA. Listeners were asked to\ 
.identify the final conson^t as voiced or voiceless (i.e., as /d/ or 
/t/). For steady-state vowels of equal duration, the presence of an 
initial consonant caused a shift in the perceptual boundary between 
the /t/ and /d/ categories, relative to the boundary 'found for the VC 
stimuli. The shift was found to be greater both for a greater dura- 
tion of transitional input (e.g., /r/ vs. /d/) and for vocalic tran- 
sitions, as opposed to noise (e.g., /r/ vs. /s/) . This outcome sug- 
gests that' a portion of the initial consonant transition duration is 
used by the lis-tener in estimating vowel dtJration, at least for the 
purpose of cuing the voiced/voiceless distinction in final position, 
and .thus that there is no discrete perception of the durations of the 
vowel or of the consonant in the syllable. 



*This is a revised version of a paper presented at the 89th meeting of the 
Acoustical Society of America, Austin, Tex., 7-11 April 1975. 

"*'a1so Herbert H.^ Lehman College of the City Uniyersity of New Yorlc. 

"^Also University of Connecticut; Storrs, and Yale University,, New Haven, Conn. 

Acknowledgment ; ^e acknowledge the assistance of Tony Levas and Suzi : 
Pollock in the data analysis. 




[HASKINS LABORATORIES: Status Report on Speech 'Research SR-42/43 (1975)] 



INT-RODUCTION 



Many studies have demonstrated the importance of formant transitions of 
vowels as cues to the perception of consonants (cf. Liberman, 1957; Liberman, 
Cooper, Shankweiler, and Studdert-Kennedy , 1967). Moreover, it has been shown 
that formant transitions simultaneously carry a variety of perceptual cues to 
consonant identity. For example, consonantal manner information (rate of change 
of F]^) , place-of^ar,ticulation information (locus of F2 and F3) , and voicing,.in- 
formation (degree of F^^ at^tenuation) are transmitted in parallel on formant ^ 
transitions. * ^ 

Although formant transitions are commonly referred tc^ as "consonant transi- 
tions'' because of the nature of the cues they carry, they are,* nevertheless, 
changes in frequency of the formant structures that characterize vowels. Thus, 
formant transitions 'carry in ^parallel both consonant and vocalic information. 
The, purpose of the present study was to investigate to what extent, if any, 
transitions^ of vowel formants ^contribute information about vowel duration, and 
about the duration of the syllable whose nucleus is the vowel. 

' . EXPERIMENT I 

SeVeral studies have shown that vowel duration may .be used to cue the 
voicing characteristic of consonants in syllable-final position (Defies, 1955; 
Raphael, 1972; Raphael, Dorman, Tobin, and Freeman, in pr^ss) . Varying the 
steady-state vowel duration In a synthetic VC syllable speech, as /ed/, will 
cause listeners to perceive the final consonant as /t/ or /d/: relatively short 
steady-state vowel durations cue /t/ judgments; relatively long vowel durations 
cue /d/ judgments. If formant transition duration is perceptually accessible 
for incorporation with the steady-state portion of the vowel, then the effect of 
adding initial consonant transitions to the VC syllable should be to shift the 
/t/-/d/ phoneme boundary toward the shorter end of the vowel range; that is, more 
final consonants should be identified as voiced. 

Method 

The experimental stimuli were produced on the Raskins 'Lab oratories parallel 
resonance synthesizer. One set of stijnuli consisted of a three-f ormant vowel, 
/e/, followed by 50-msec . f om^ant transitions appropriate for the voiced stop 
consonant /d/. The stop was synthesized without cues for release. The steady- 
state duration of the vowel was varied in 10-msec steps from 30 to 150 msec. 
Another stimulus series used the VC stimuli as a base, but attached 60-msec for- 
mant transitions appropr^iate for [d] to the beginning of each VC signal. Six 
tokens of each of the VC and CVfl stimuli were generated. These stimuli were 
then randomised, recorded, and reproduced for 12 beginning phonemic students at 
Herbert H. Lehman College. The stimuli were delivered over a loudspeaker in a 
sound-treated room in the audiology laboratory of the college. The listeners 
were instructed to identify the final consonant of each stimulus as /t/ or /d/. 

Results and Discussion 

The percentage of /t/ responses for both the VC and CVC stimili are ^shown 
in Figure 1. . Phoneme boundaries were determined by fitting straight-line func- 
tions to the identification functions foi> both VC and CVC stimuli. The difference 
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Figure 1 



in phoneme boundaries for the two series was 43 msec. This outcome indicates 
that at least some of the duration of the formant transitions associated with 
the initial consonant was processed by the listeners as part of the vowel/sylla- 
ble duration. • 

. EXPERIMENT II ' " 

• ' ■ 

Experiment I revealed that some portion of the initiall foifmant transition 
duration may be used to determine vowel duration. The purpose of Experiment II 
was to replicate and extend the outcome of the first experiment by assessing the 
effect of longer (100-msec) "intital formant transitions on the perception of 
voicing in syllable-final consonants. ^ 

Method . ' (1 

The VC stimuli of Experiment I were also used in Experiment II. In addi- 
tion, a series of stimuli were created by attaching IQO-msec duration formant 
transitions appropriate for /r/ to the beginning of each of the VC stimuli. The 
/r/ formant transitions were created by linear interpolation from the starting 
resonances tp the vocalic resonances. Thus, the /r/ transitions did not contain 
initial steady-state resonances. Six tokens of each stimulus were generated. 
These stimuli were then randomized and recorded on audio tape. The listeners of 
Experiment I also participated in Experiment II. As- in Experiment I, the lis- 
teners were instructed to label each stimulus as ending in /t/ or /d/. 

Results and Discussion 

The percentage of /t/ responses for the CVC and VC stimuli are shown in 
Figure 2. The phoneme boundaries differed by 65 msec. As in the first experi- 
ment, a shorter duration of steady-state vowel was associated with the /t/-/d/ 
boundary in^the CVC syllables,* again suggesting that at least fiome of the dura- 
tion of the transitions for the initial /r/ was incorporated into the duration 
of the vowel/syllable. 

EXPERIMENT III 

Since the 100-msec /r/ transition caused a greater shift in the phoneme 
bounc^ry than did the 60-msec /d/ transitions, we may suppose that listeners are 
able to incorporate duration information from transitions in proportion to the 
magnitude ofl^the formant transition duration. The purpose of Experiment III was 
to determine whether vocalic transitions, that is, resonances that predict the 
formant locus of the following vowel, are a necessary condition for the vowel/ 
syllable lengthening effects of Experiments 1 and II. To this end. Experiment 
III compared the identification of syUable-f inal /t/-/d/ in VC stimuli, and in 
CVC stimuli in which the syllable- initial consonant was /s/. 

Method ^ . 

The VC stimuli used in 'Experiments I and II were also used in Experiment 
III. Vowel^durations ranged from 50 'to 150 msec in lO-msec steps. In addition, 
a series of CVC stimuli were created by attacfiing, with t)ie aid of the Raskins* 
PCM system, a 100-msec natural speech /s/ to the beginning of each of the VC 
stimuli. ^ As in the previous experiments, six tokens of each CVC and VC stimulus 
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waB generated xandoTEt^'d^ and" r^dorded 6n audio'' tape . \ Thie listener a were 11 
beginning phonetic students at- Herbert H. L^Iun^n College, The listening condi- 
tions and t^sk were" identical to those in. the first two experlmenta. - , 

R^esults and Discussion ^ * 

Figure 3 displays the perceiltage. of 7t/ responses for the VC apd CVC stimu- 
li. The phoneme boundary differed between the twb canditiorts l?y 6 msec. *A1- 
though the overall boundary aKift was small, 10 of the 11 listeners 'evidenced 
the shift. The magnitude of the boundary eljift suggBSts. that, the larger bound- 
ary shifts found in Experiment P and -II werfe pa:;;imafily due to the incorparatipn ' ' 
of the? formant" transition difrktion into the estimate b€ vowel duration^, and were 
nbt due to' increased syllable duration^' The small •boundary shift- also suggests 
that well-defined formant struc^ture may be a pr&rjB-quisite' tor ,,the relatively 
large vowel lengthening effects of Experiments;! ai^d II. « - 

- "co nclusion " * ' r - ■ ' 

" -. ■ — ■ — ' — / ■ . ^ . /' , 

The results of Experiments I--III Jjcidic ate . that consonant, and vowel dura- 
tions in CV(C) sequences are aiot processed separately, but rather that they are, 

-to a certain exte.ttt,* domj^ined to give a perceptual estimate of vowel and sylla?:, 
ble durations. It seems clear that diitational Information in voiced formant 
transitions (e.g.,'/d/ a^d /r/) is perceptually accessible 4:0 listeners^ for in- 
corporation into an estipat^ of vOwel duration, Eurther, .our results indicate 
tliat the longer the initial „voiced formant transitions, the greater the length- 
ening of perceiyed vowqI duration. However, only part af the duration of the 
formant transitions is used in the estimate of Vowel duratfon. We i^fer this ^ 
from the fact that the '^^hif t in phoneme boundaries in Experiments I and I^I was 

, less than the total duration of the ^formant transitions. It may be that some 
minimal duration of formant tr^sitlon informatioii is necessary for cpnsonant 
identification ai|d that the remainder may be; processed as part of the vgwel. 
Such a conclusion, however, awaits the results of further, experimentation. ^ 

The effect of the initial fricative in the CVCs of the third experiment, 
i.e., a sjnall but consistent shift in the phoneme boundary, is interesting for 
several reasons. First, it provides an indication that syllable {^uratioAy and 
not Just vowel duration, is operative as a cue in the final cons o^an|.* voicing 
distinction. The data suggest, however, th^t the syllable duration cue is per- 
ceptually much less salient than the vowel duration cue: ^ f or an equal' duratiptf - 
(100-msec) signal, the f9rmant-poor /s/ proViided dTvary small shift in the \^ 
• phonem^ boundary -compared to the formant-rich /r/. Second,' we may speculatfe 
that consonant cues conveyed b/ foirmants contribute relatively heavily to the 
perception of the overall duration of the vowel or syllable^ whereas those con- 
tained in noise provide relatively less perceptual input to the esti'mate of 
vowel/syllable duration. Once again, conformation of such speculation awaits 
further research. We ^re currently assessing the effect of syllable- iiril^ai 
[h], i.e., noise-excited fprmatits contiguous with the voiced formants^ of t^e 
following vowel; [t^], i.e., transient noise with no formant structure, followed 
by [h]-like noise with clear formant stiructure" that is contiguous vl'th-the fol^- 
lowing vowel; and [m], i.e., well-defined forma^jt structure, biit with a discon-^ 
tinuity between consonantaJ. and vocalic resonances, on the identification' of 
voicing, in syllable- final stop consonants. 
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On Accounting for the Poor Recognition of Isolated Vowels* 

4- ' -H- +++ 

Donald Shankweiler, Winifred Strange, and Robert Verbrugge 

ABSTRACT 

Eariier. studies have shown that vowels spoken in isolation tend . 
to be poorly perceived, even when they are produced by phonetically 
trained talkers. Listeners, however, gener§lly make remarkably few 
errors in identification ^of vowels in consoifant-vowel-cbfisonant (CVC) 
environment, evan when each syllable is uttered by a different talker. 
Sets of nine American English vowels were spoken by a panel of talk- 
ers: in isolation and in a /p-p/ environment. Measurements of the 
fir^t three formant frequencies were obtained from spectrograms.* 
Listening tests were made up by randomizing talkers and tokens and 
these were presented to phonetically naive listeners. Percent recog- 
nition of the intended vowel (averaged over vowefs) wa? 83 percent for 
the /p-p7 condition and 58 percent for the isolated condition. When 
the as3nnptotic formant j^requencies of each talkar's isolated and 
medial vowels are compared, the values are found to be highly similar. 
A nontrivial explanation must be sought for the perceptual difficulty 
of isolated steady-state vowels. The data point to the conclusion 
that no single temporal cross section of a syllable conveys as much 
vowel information to a perceiver as is given in the dynamic contour of 
the formants. 

Central to current conceptions of the 'vowel is the idea of the target. In 
articulatory terms the target is a configuration of the vocal tract toward which 
the articulators aim. In practice, ideal vowel targets are defined in the 
acoustic record by formant frequency values obtained .from quasi steady-state 
vowels produced in isolation. It is well-known, of course, that in words and 
sentences steady states are rarely attained and that formant frequencies usually 

.- i ■ , 



*Paper presented at the 89th meeting of the Acoustical Society of America, 
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University of Minnesota, Minneapolis. 

University of Michigan, Ann Arbor. 

Acknowledgment ; The research reported here is a cooperative endeavor shared 
by the Center for Research in Human Learning of the University of Minnesota, 
and Raskins Laboratories. It is supported in part by grants to the Center 
and to Raskins Laboratories by the National Institute of Child Realth and 
Human Development. / 

[RASKINS LABORATORIES: Status Report on Speech Resakrch SR-42/43 (1975)1 

285 



28;> 



'vary as a tunction of time tTifoughout the vowel. Consonantal environment pro- 
duces systematic shifts in formant frequencies causing them to deviate from tar- 
get values in direction and amount that is largely predictable from articul^tory 
considerations (Lindblom, 1963; Stevens arid House, 1963). It is not known how 
perceivers take account of context-conditioned variation in perception of vowels. 
Generally Lt is assumed that the listener extracts the target formant values 
whether or pot these are acoustically realized in the signal.^ This view encoun- 
ters a difficulty, however. It has been noted several times in the literature 
that isolated vowels tend to be poorly perceived. \ This raises the question of 
whether vowels can be adequately described by a compact table containing only a 
single value for each of the first two or three formants. 

In a previous Study (Strange, Verbrugge, and Shankweiler, 1974), we investi- 
gated the contribution of consonantal environment to- perception of medial vowels. 
Nine American English vowels were produced in a number of consonantal environ- 
ments and in isolation by a panel of 15 untrained talkers, which included five 
children — ranging in age from four to ten — five adult females, and five adult 
males. The utterances were recorded on magnetic tape and assembled into a set 
of listening t6sts by randomly mixing the voices from token to token. We pre- 
sented 'the tests to a group of listeners for whom these were novel voices. Mis- 
identification of isolated vowels occurred with significantly greater ' frequency 
.than of medial vowels in a number of consonantal^^nvironments. Here we present 
the results of perception tests for vowels in^ /p-p/ environment and in isolation. 
An average of 17 percent of the vowel nuclei were- mlsidentif led as the talker's^ 
intended -vowel when they occurred in the /p-p/ frame, and 42 percent were mis- 
identified in isolation. Figure 1 shows a vowel-by- vowel breakdown of the 
errors. * ' . 

It is apparent that the presence of a consonantal environment produced a . 
consistent facilitation in identification of all nine vowels. Our data are in 
.agreement with earlier findings of Fairbanks and Grubb (1961), Fuj-imura and 
Ochiai '(1963), and Lehiste and Meltzer (1973). We ^can conclude that isolated ► 
vowels are significantly more often misidentif led than medial vowels spoken 
junder comparable conditions. Could it be that the acoustic complexities intro- 
duced b^ syllabic structure better serve the requirements of the percejltual 
apparatus than do quasi steady-state targets? ^ 

Before accepting this conclusion, we mfist first ascertain whether the talk- 
ers produced ''isolated vowels with formant frequencies uncharacteristic of the 
values reported by earlier investigators. Since isolated vowels do not typically 
occur in natural speech, we ^cannot overlook the possibility that our talkers, who 
had no training in phonetics, produced them in peculiar ways that rendered them 
relatively unintelligible to the listeners. 

The purpose of this study was to investigate that possibility. We under- 
took spectrographic analysis of the tokens of isolated vowels and medial vowels 
used in our listening tests. Spectrograip^ were made on a voiceprint spectrum 
analyzer of the tokens used in the perp^ptmid tests. Sections were made at the 
point of closest approach to steady s^ate. Center frequencies of F^* ^2> ^3 
were measured. Tokens uttered by children and women were first rerecorded at 
half-speed to facilitate the task of locating the formants. We turn first to the 
results for isolated vowels. 
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In Figure 2, we see '^i/'P2 plots of each of five tokens of the nine vowels 
for the five children. The enclosed areas include all the tokens of a given 
type. Asterisks give the average values based on child talkers from the Peterson 
and Barney (1952) survey. The Peterson-Barney values make an appropriate stan- 
dard of comparison inasmuch aa Stevens and House (1963) .have demonstrated that 
asymptotic formant frequencies of vowels in /h-d/ environment, as employed by 
Peterson and Barney, closely approximate those obtained for vowels in isolation. 
The figure shows that, with the exception of /-ae/ and /o/, our measurements 
cluster around the Peterson-Barney ^et-^ge values. 

1 Figures 3 and- 4 give the data for acfult females and males, respectively. 

For all three groups of talkers, when the tokens are segregated by category of 
speaker, the vowels separate out well, except for the women's back vowels. The 
values for /o/ tend to be displaced in the direction of /a/, reflecting the pre- 
dominant dialect of the upper Midwest, which minimizes^' the /a/-/o/ distinction. 
But, in general, it. is apparent that the target values attained by our talkers in 
production of isolated vowels agree rather well with the values reported by 
^feterson and Barney (1952) for vowels in an /h-d/ environment. Thus, we can 
conclude that our talkers, for the most part, adopted conventional targets in 
their productions of isolated vowels. 

^ Figure 5 depicts -the vowel spaces for cMldren, women, and men based on 

^ ' average values of F^ and F2 for each vowel. These vowel polygons show charac- 

V teristic differences for the three categories of talkers. " ( 

The next step was to compare directly the talkers* vowel spaces for iso-- 
lated vowels and for medial vowels in /p-p/ environment. Figure 6 shows that 
the spaces are largely congruent, F2 of the medial vowels showed a slight 
migration toward the center^of the space. This is in accqrd with results re- 
ported by Stevens and House* (1963) for vowels produced between labial consonants 
y The effect of the shift is to reduce the acoustic contrast among the medial vow- 
els relative to the isolated vowels. - 
♦ 

Figure 7 displays all the tokens actually used on the listening tests 
arrayed in F2^/F2 space. Isolated vowels are displayed in black; medial vowels 
\ in gray. Here it is abundantly clear that the sets of vowel tokens produced 
under the two conditions of this experiment occupy approximately the same space. 
The isolated vowels are, /as ^pected, a little^ better separated acoustically 
than the vowels in /p-p/ environment. 

It should be mentioned that for both sets of vowels a proportion of the 
tokens deviates markedly from the average values. Before we accept the conclu- 
sion that the inferior intelligibility of isolated vowels cannot be attributed 
to aberrant formant values, we should ask whetl^r error rates, on individual 
^tokens Cian be predicted fropi their acoustic distance from an average value. If 
this ^Is the case,* then it becomes important to establish whether the variability 
of targets attained :for a giveji vowel is greater for isolated vowels than for 
vowels in /p-p/ enViI^Dnment. A full answer to these questions awaits further 
study. l)e knows however, that the relative^ intelligibility of a token cannot 
be estimated v6ry precisely frpm its po&ition in the space defined by the two 
Eormants, a £act also noted by Peterson and Barney (1952). 

Likewise, mea^surements of vowel duration Indicate that differences in dura- 
tions of medial vowels and"_ Isolated vowels fail to accc^unt for the consistent 



ERIC 



. 288 



iSOLATED VOWELSt-^ ^VE-C44 1 LDRiN^ 



3000 



2500 



N 



2000 



Z 
< 



z 

o 

u 

UJ 

CO 



1500 



1000 



500 




^Peterson & Barney average 




J_ L 



■ ' ' l—^l L 



I I 



200 300 400 500 600 700 800 900 1000 1100 

FIRST FORMANT (Hz) 



Figure 2: F1/F2 plots of five tokens of nine English vowels spoken by five 
children. 
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ISOLATED VOWELS: FIVE WOMEN 
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Figure 3: Fx/F2 plots of five tokens of nine English vowels spoken by five 
adult females. " 
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Figure 4: F2/F2 plots fiv^ tokens of nine English vowels spoken by five 
adult males. 
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Figure 5: Mean F1/F2 points for five tokens of nine English vowels averaged 
separately for men, women, and children. 
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Figure 6: Mean '^\l'^2 points for five tokens of nine English vowels in /p-p/ 
environment and in null (//-//) environment spoken by 15 talkers, 
randomly mixed. 
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Figure 7: F1/F2 plot of 45 vowel tokens spoken in /p-p/ environment and in 
null (/#-#/) environment by 15 talkers. 
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increase in error rate for all isolated vowels. Although the durations of the 
latter were considerably longer than vowels in /p-p/ environment, the relative 
durations of the isolated vowels were the same as for vowel's in /p-p/ position, 
except for /u/, /e/, ^nd /i/, which were relatively longer in isolation than in 
context. , . 

Having tentatively ruled Out a theoretically uninteresting explanation of 

what accounts^ then, for the superior intelligibility of vowels in the stop en- 
vironment. Surely this ia puzzling, given th6 usual ^^sumptlons about the nature 
of the acoustic information that specifies vowel quality. An isolated, quasi 
steady-state utterance in which the formants attain appropriate targets- ought to 
be an optimal signal for perception. Indeed, synthetic steady-state vowels 
based on these formant parameters are fairly intelligible to listeners. More- 
over, in the domain of automatic speech recognition, some success has been 
achieved with a static model of the vow^Jk'^^erstman (1968) devised an algorithm 
based on values of Fj^ and Fo derived from spj^qtrographic measurements of center 
formant frequencies of /h-d/ syllables recorded from 76 talkers by Peterson and 
Barney (1952). Gerstman's algorithm sorted nine vowels in this set with only 
2.5 percent error, less than was made by human listeners. From such a result, 
one might infer that target formant frequencies can in principle unambiguously 
specify the vowels of English as produqed by a variety of talkers. It is but a 
short step to the conclusion that a human listener's strategy in identifying 
vowels is to extract the target formant values bj^ means of something like a 
filter bank. 

However, as we saw, this conception of the vowel cannot be reconciled eas- 
ily^with certain facts of perception. Since vowels in isolation were poor sig- 
nals from the' perceiver's standpoint, even though talkers adopted appropriate 
targets (differing little from /p--p/ targets), we can conclude that target fre- 
quencies do not adequately specify a vowel. Cues that we ordinarily regard as 
consonantal must contribute to the perception of the vowel. We suspect that 
much vowel information is contained in formant transitions, as Lindblom and 
Studdert-Kennedy (1967) suggested some time ago. In an analysis of perceptual 
adjustments for differences in stress and speaking rate, these investigators 
found that vowel identifications varied^with direction and rate of transitions 
even when the formant frequency values at syllable centers were held constant. 
A case for the importance of formant transitions in vowel perception might also 
be made from considerations, such as those raised by Liberman (1970), of the 
extent of variation iPfe formant contours conditioned by phonetic context. In any 
case, we are planning experiments to test the hypothesis directly by studying 
the effects of different consonantal environments, with and without transitions, 
on the perception of a coarticulated vowel. 

Whatever the nature of the contribution consonantal environment makes to 
the identification of a vowel, the point to the general conclusion that no 

single temporal cross section of a sellable conveys as much vowel information 
to a perceiver as is given in the dynamic contour of the formants. Thus it 
would seem that the definition of a vowel, from the standpoint of perception, 
ought to include a specification of ho\ir the relevant acoustic parameters change 
over time. If this conclusion is correct, then the specification^ of the relation 
between sound and percept presents the same problems for vowels as for conson- 
nants. - i . v 
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Some Acoustic Measures of Anticipatory arid Carryover Coarticulation* 
Fredericka Bell-Bert i and Katherine S. Harris 



ABSTRACT 

Knowing the anticipatory and carryover limits of coarticulation 
will allow us to specify the constant features of a set of speech 
sounds. Most earlier work has reported on observations of the artic- 
ulator movement or acoustic result of coarticulation. The present 
study is an attempt to define the extent and effects of cdarticula^ 
tion on the speech acoustic signal. Preliminary results suggest that 
carryover effects are more extensive than anticipatory effects. 

INTRODUCTION 

The nature and extent of coarticulation is of central interest to theories 
of speech production. Previous work on this problem, for several languages, has 
shown that anticipatory (or rlght-to-lef t) effects are either equal to, or 
greater in extent than, carryover (or l^f t-to-right) effects, and that anticipa- 
tory effects may be different in cause from carryover effects (Daniloff and 
Hammarberg , 1973) . 

More Specifically, Kozhevnikov and Chistovich (1965) and Daniloff and Moll 
(1968) have found anticipatory effects to extend over as many as three phoneme 
segments and across syllable boundaries. These effects have been explained as 
the reorganization of motor patterns for speech segments. Carryover effects, 
on the other hand, have often been attributed to mechanical inertia or articula- 
tor "sluggishness" (Lindblom, 1963; Stevens and House, 1963; Henke, 1966; 
Stevens, House, and Paul, 1966; MacNeilage, 1970), although these effects are 
now sometimes considered to be a deliberate reorganization of speech segments in 
the same way anticipation is a deliberate reorganization (MacNeilage and 
deClerk, 1969; Sussman, MacNeilage, and Hanson, 1973; Ushijima and Hl,rose, 1974). 



*A version of this paper, under the title "Coarticulation in VCV and CVC 
Utterances: Some EMG Pata," was presented at the 89th meeting of the Acousti- 
cal Society of America, Austin, Tex., 7-11 April 1975. 

"^Also Montclair State Ccdlege, Upper Montclair, N. J. 

''^Also The Graduate School and University Center of the City University of 
New York. ^ 

[HASKINS LABORATORIES: Status Report on Speech Research SR-42/43 (1975)] 
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In spite of the central position of coarticulation rules in a general 
theory of speech production, there are very few descriptive data on the relative 
magnitude of anticipatory and carryover coarticulation effects at an acoustic 
level. The experiment to be described was an attempt to fill this gap in our 
knowledge. 

PROCEDURE 



The utterance set contained 18 three-syllable nonsense words, consisting of 
-a grtrg-eftged r rmRfrnTrnl-^-OTTGria^l >*^rnpm (f!VfT) pT-ffrFrHTyft l^^ sj fo^l and f olloved :fay 
[ap]. The vawel in the stressed syllable was either /i, a, or u/, and the con- 
sonants were /p, t, k/. All combinations of consonants and vowel were used, ex- 
cept the symmetric ones, such as /pspikap/, /patupap/, and /pakatap/. The 
utterances were spoken within a carrier phrase, "Say now," at a con- 

versational rate of speech. 

Acoustic recordings were obtained, from one speaker of American English, of 
18 repetitions of each of the 18 utterance types. ^ 

The audio signal was sampled through the Raskins Laboratories pulse-code- 
modulation (PCM) and Spectrum Analyzing Systems, the former for editing, the 
latter for generating spectrum data. After software filtering (and threshold- 
iiig) » ha^ copies of computer-generated spectrograms were obtained and formant 
measurements made off.line. 

Since second-formant position is extremely sensitive to back-to-front / 
tiongue position and lip-rounding — that is, front cavity length — F2 measurements 
were made at seven points in each repetition of each utterance type. Average^ 
of 15 to 18 measurements for each sample point were obtained, corresponding to 
those tokens included in the EMG average for each utterance type. Schematic 
Spectrograms of F2 were generated from these averages. 

The measurement points were 

1. One point in a^; 

2. The beginning, middle, and end points of the stressed vowel; 

3. The beginning, middle, and end points of a2. 

No attempt was made to, account for durational variation, since the sample 

time represented by each data point in the spectrogram is 12.8 msec; hence, the 
.time scale is too crude for detailed measurements. 

RESULTS 

The results are summarized in Figures 1 and 2; the first shows the 18 
utterances plotted with the first consonant held constant; the second shows the 
same data with the second consonant held constant. 

Beginning with the stressed vowel, we see that the initial point is deter- 
mined by the preceding consonant, and the end point is affected by the following 
consonant. The effects of the terminal consonant on the midpoint of the stressed 
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vowel are not as, large as those of the Initial consonant. The mldvowel position 
Is more often determined by the, Initial consonant. In other words, the carry- 
over effect of the first consonant oh the stressed vowel is larger than the 
anticipatory effect of the second. 

We can also examine the relative magnitudes of anticipatory and< carryover 
effects by looking at the effects of environment on the initial and terminal 
schwa vowels. One-step effects are seen in both directions — the initial schwa 
is affected by the following consonant, while the second schwa 1§ affected by 
^the preceding consonant. However, when we tuxna to the two-step effects, we find 
that the initial schwa is not affected by. the fcllovring vowel, while the same 
vowel does change th^ value of the following schwa. In general, then, at the 
^cbuBtlc level, carryover effects are larger than anticipatory effects. It is 
this asyimaetry of effect that must be accounted for at an articulatory level. % 
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